MATHEMATICAL METHODS 
OF STATISTICS 


By 

HARALD CRAMfiR 

PKOFBMOm IN THS VM1VBM1TY 
OF »¥0< ；霣霣 0141 


ffl 

ASIA PUBLISHING HOUSE 

BOMBAY CALCUTTA - NEW DELHI MADRAS 
LONDON • NEW YORK 




Printed In the United States In 1940 


PRINTED 

AND 


Second Printing: 1947 
Third Printing: 1947 
Fourth Printing: 1949 
Fifth Printing*. 1901 
Sixth Printing: 19M 
Seventh Printing ： 1957 
Eighth Printing* 1968 
Ninth Printing; 1901 

First Indian Edition: 1962 
Under agreement with tfu 
PRINCJBTON UNIVERSITY PRESS 


BY M. E. EAPEN AT THE GLS PRESS, BOMBAY, 
PUBLISHED BY P. S. JAYAS1NGHE, 

ASIA PUBLISHING HOUSE, BOMBAY 



MARTA 




PREFACE. 


During the last 25 years, statistical science lias made great pro¬ 
gress, thanks to the brilliant schools of British and American statis¬ 
ticians, among whom the name of Professor R. A. Fisher should be 
mentioned in the foremost place. During the same time, largely owing 
to the work of French and Russian mathematicians, the classical 
calculus of probability has developed into a purely mathematical theory 
satisfying modern standards with respect to rigour. 

The purpose of the present work is to join these two lines of de¬ 
velopment in an exposition of the mathematical theory of modern 
statistical methods, in so far as these are based on the concept of 
probability. A full understanding of the theory of these methods 
requires a fairly advanced knowledge of pure mathematics. In this 
respect, I have tried to make the book self-contained from the point 
of view of a reader possessing a good working knowledge of the 
elements of the differential and integral calculus, algebra, and analytic 
geometry. 

In the first part of the book, which serves as a mathematical in¬ 
troduction, the requisite mathematics not assumed to be previously 
known to the reader are developed. Particular stress has been laid 
on the fundamental concepts of a distribution, and of the integration 
with respect to a distribution. As a preliminary to the introduction 
of these concepts, the theory of Lebesgue measure and integration 
has been briefly developed in Chapters 4 — 5, and the fundamental 
concepts are then introduced by straightforward generalization in 
Chapters 6—7. 

The second part of the book contains the general theory of random 
variables and probability distributions, while the third part is devoted 
to the theory of sampling distributions, statistical estimation, and 
tests of significance. The selection of the questions treated in the 
last part is necessarily somewhat arbitrary, but I have tried to con¬ 
centrate in the first hand on points of general importance. When 
these are fully mastered, the reader will be able to work out appli¬ 
cations to particular problems for himself. In order to keep the volume 


VII 



of the book within reasonable limits, it has been necessary to exclude 
certain topics of great interest, whioh I had originally intended 
to treat, such as the theory of random processes, statidtioal time 
86riM and periodo^rams. 

The theory of the statistical tests is illustrated by numerical 
examples borrowed from yarious fields of application. Owing to con¬ 
siderations of space, it has been necessary to reduce the number of 
these examples rather severely. It has also been necessary to restrain 
from every discussion of questions concerning the practical arrange¬ 
ment of numerical calculations. 

It is not necessary to go through the first part completely before 
studying the rest of the book. A reader who is anxious to find him¬ 
self in medias res may content himself with making some slight 
acquaintance with the fundamental concepts referred to above. For 
this purpose, it will be advisable to read Chapters 1 一 3, and the 
paragraphs 4.1 — 4.2, 5.1—5.3, 6.1 — 6.2, 6.4—6.6, 7.1 — 7.2, 7.4 一 7.5 and 
8.1 一 8.4. The reader may then proceed to Chapter 13, and look up 
the references to the first part as they occur. 

The book is founded on my University lectures since about 1930, 
and has been written mainly during the years 1942 一 1944. OwiDg to 
war conditions, foreign scientific literature was during these years 
only very incompletely and with considerable delay available in Swe¬ 
den, and this must serve as an excuse for the possible absence of 
quotations which would otherwise have been appropriate. 

The printing of the Scandinavian edition of the book has been 
made possible by grants from the JRoyal Swedish Academy of Science, 
and from Stiftelsen Lars Hiertas Minne. I express my gratitude to¬ 
wards these institutions. 

My thanks are also due to the Editors of the Princeton Mathema¬ 
tical Series for their kind offer to include the book in the Series, and 
for their permission to print a separate Scandinavian edition. 

I am further indebted to Professor E. A. Fisher and to MessrB 
Oliver and Bojd for permission to reprint tables of the t- and X 9 - 
distributions from »Statistical methods for research workers*. 

A number of friends have rendered valuable help during the 
preparation of the book. Professors Harald Bohr and Ernst Jacobsthal, 
taking refuse in Sweden from the hardships of the times, have read 
parts of the work in manuscript and in proof, and have given stimulating 
criticism and advice. Professor Herman Wold has made a very careful 
scrutiny of the whole work in proof, and I have greatly profited 


VIII 



from his valuable remarks. Gosta Almqvist, Jan Jung, Sven G. Lind- 
blom and Bertil Mat^rn have assisted in the nomerical calculatioiifl, 
the revision of the manuscript, and the reading of the proofs. To all 
these I wish to express my sincere thanks. 

Department of Mathematical Statistics 
University of Stockholm 
May 1945 
H. C. 




Table op Contents., 


First Part. 


MATHEMATICAL INTRODUCTION. 

Chapters 1 — 3. Sets of Points. 

p«c« 

Chapter 1. General properties of sets. 3 

1. Seta. — 2. Subsets, space. 一 8. Operations on Mts. 一 4. Sequenoes of 
sets. 一 6. Monotone sequences. — 6. AdditWe classes of sets. 

Chapter 2. Linear point sets.10 

1. Intervals. — 2. Various properties of seta in R t . — 8. Borel sets. 

Chapter 3. Point sets in n dimensions.IS 

1. Intervals. 一 2. Various properties of sets in A». 一 8. Borel sets. 一 4. 
Linear sets. — 5. Snbapaoe, product apace. 

References to chapters 1 — 3.18 


Chapters 4 一 7. Theory of Measure and Integration 

in R v 

Chapter 4. The Lebesgue measure of a linear point set.19 

1. Length of an interval. — 2. Generalisation. 一 8. The mcaanre of a sum 
of intervals. 一 4. Outer and inner measure of % bounded set. 一 6. HeMurable 
sets and Lebesgue meaflnre. 一 6. Th« class of meaaurable sets. 一 7. Mea¬ 
surable seta ftnd Borel sets. 

Chapter 5. The Lebesgue integral for functions of one variable. 33 
1. Th« integral of a bounded function over a set of finite measure. 一 2. B- 
meaaarable fanotions. 一 8. Properties of the integral. — 4. The integral of 
an unbounded function over a set of finite measure. — 6. The integral over 
a set of infinite meaaan. 一 6. The Lebesgue integral at an additive set 


fanctton. 

Chapter 6. Non-negative additive set functions in .48 

1. G«ner»Iixfttion of the Lebesgue mernsun and the Lebesgue integral. 一 S. 


Set funotioiis ftnd point fanotlons. 一 8. Constniction of a set function. — 
4. P-measure. 一 6. Bounded set functions •一丨 Distribution. — 7. Sequen¬ 
ce* of dialrifautionB. 一 8. A conTergence theorem. 


XI 









p*g« 


Chapter 7. The Lcbesgue-Stieltjes integral for functions of one 


variable. 62 

1. The integral of a bounded function over a set of finite P-measure. — 2. 


Unbounded fanctionM and sets of infinite PmeMure. — 3. Lebesgue-Stieltjea 
integrals with a parameter. — 4. Lebesgue-Stieltjes integrals with respect 
to a distribution. 一 6. The RiemanD-StieUjes integral. 

References to chapters 4 一 7. 75 

Chapters 8 — 9. Theory of Measure and Integration 

IN R n . 

Chapter 8. Lebesgue measure and other additive set functions 

in R n . 76 

1. Lebesgae measure in Rn. 一 2. Non negatite additive set functions in Rn. 

一 3. Bounded set functions. —— 4. Distributions. 一 6. Sequences of distri- 
bntions. — 6. Distributiona in a product space. 

Chapter 9. The Lebesgue-Stieltjes integral for functions of n 

variables. 8S 

1. The Lebesgae-Stieltjes integral. — 2. Lebcsgue-Stieltjea integrals with 
respect to * distribution. 一 3 A theorem on repeated integrals. — 4. The 
Riemaon-Stieltjes integral. 一 6. The Schwarz inequality. 

Chapters 10 — 12. Various Questions. 

Chapter 10. Fourier integrals . 89 

1. The characteristic function of a distribution in A,. 一 2. Some auxiliary 
functions. — 3. Uniqueness theorem for characteristic functions in H, 一 
4. Continuity theorem for characteristic functions in K,. — 6. Some particular 
integrals. 一 6. The characteristic fanction of a distribution in Rn. — 7. 
Continuity theorem for characteristic functions in R n . 

Chapter 11. Matrices, determinants and quadratic forms.103 

1. Matrices. — 2. Vectors. — 3. Matrix notation for linear transforma. 
tions. — 4. Matrix notation for bilinear and quadratic forms 5 Deter 
minants. 一 6. Rank. 一 7. Adjugate and reciprocal matrices. 一 8. Linear 
equations. 一 9. Orthogonal matrices. Characteristic nambers. 一 10. Non- 
negative quadratic forms. 一 11. Decomposition of X xj. 一 12. Some inte¬ 
gral formulae. 

Chapter 12. Miscellaneous complements.122 

1. The symbols 0, o and <v>. 一 2. The Ealer-MacLanrin sum formula. 一 
8. The Gamma fanction. — 4. The Beta function. — 6. Stirling'! formula. 

一 6. Orthogonal polynomials. 


XII 










Second Part, 

RANDOM VARIABLES AND PROBABILITY DISTRIBU- 

TIONS. 


Chapters 13 — 14. Foundations. 

p»c« 

Chapter 13. Statistics and probability.137 

1. Random experiments. 一 2. Examples. 一 3. Statistical regularity. 一 4. 
Object of a mathematical theory. — 6. Mathematical probability. 

Chapter 14. Fundamental definitions and axioms.151 


1. Random variables. (Axioms 1 — 2.) 一 2 Combined variables. (Axiom 3) 
一 3. OonditioDAl distributions. 一 4. Independent Taritbles. 一 6. Functions 
of random variables. —— 6. Conclusion. 


Chapters 15 — 20. Variables and Distributions in R r 
Chapter 15. General properties.166 

1. Distribution function and frequency function. 一 2. Two simple types of 
distributions. 一 3. Mean values. — 4. Moments. — 6. Measures of location. 

一 6. Measures of dispersion. — 7. Tchebychelf's theorem. — 8. Measures 
of skewness and excess. 一 9. Characteristic functions — 10. Semi-iovan- 
anta. — 11. IndepeDdent variables^ ^ 12. Addition of independent variables. 

Chapter 16. Various discrete distributions.. 192 

1. The function e (x). — 2. The binomial diatributiop — 3. Bernoulli s 
theorem. — 4. Dc Moivre's 一 b. 'l&e Foi?soti distribution. 一 6. 

The generalized binomial distribution of Poisson. 

Chapter 17. The normal distribution.208 

1. The normal fuDctions ?" 一 2. The normal distribution. — 3. Addition of 
independent normal variables. — 4. The centra) limit theorem. 一 6. Comple¬ 
mentary remarks to the central limit theorem. — 6. Orthogonal expansion 
derived from the normal digtribation. 一 7. Asymptotic expansion derived 
^rom^th© normal distribution. 一 8. The r61e of the normal distribution in 


statistics. 

Chapter 18. Various dj«tribution8 related to the normal.233 

1 . The x* di®tributioIl^ / ~- 2. Student's distribution. 一 9 ^ > j|hfr t r -- 

tion. 一 4. The Beta distribution. 

V— 1 

Chapter 19. Further continuous distributions.244 


1. The rectangular diitribution. — 2. Cauchy * and Laplace's distributions. 
一 3. Tfn” 叫 m a Tii- Pearson nystem. 


XIII 










Page 

Chapter 20. Some convergence theorems.250 

1. Convergence of distributions and variables. 一 2. Convergence of certain 
distribationa to the normal. — 8. Convergence in probability. 一 4. Tche- 
bychefl's theorem. 一 6. Khintchloe's theorem. 一 6. A convergence theorem. 

Exercises to chapters 15 — 20.255 


Chapters 21 一 24. Variables and Distributions in R n . 


Chapter 21. The two-dimensional case 


260 


1. Two simple types of distributions. 2. Mean values, moments. 一 8. 
Characteristic (unctions. 一 4. Conditional distributions. 一 6. Regression, I. 
一 6. Regression, II. — 7. The correlation coefficient. 一 8. Linear trans¬ 


formation of variables. — 9. The correlation ratio and the mean square 


contingency. 一 10. The ellipse of concentration. — 11. Addition of inde¬ 
pendent variables. 一 12. The normal distribution. 

Chapter 22. General properties of distributions in R n .291 


1. Two simple types of distributions. Conditional distributions. 一 2. 
Change of variables in & continuous distribution. — 8. Mean values, mo¬ 
ments. — 4. Characteristic functions. — 6. Hank of a distribution. 一 6. 


Linear transformation of variables. 一 7. The ellipsoid of concentration. 


Chapter 23« Regression and correlation in n variables.301 

1. Regression sarfaces. — 2. Linear mean square regression. — 3. Residuals. 

4. Partial correlation. — 6. The multiple correlation coefficient. — 6. Or¬ 
thogonal mean square regression. 

Chapter 24. The normal distribution.310 

1. The cbaracteriatio function. — 2. The non singular normal distribution. 

一 8. The singalar normal distribution. 一 4. Linear transformation of nor- 
m*Uy distributed variables. 一 6. Dlstribation of a sum of squares. 一 6. 
Conditional distributions. — 7. Addition of independent variables. The cen¬ 
tral limit theorem. 


Exercises to chapters 21 — 24 


317 


Third Part. 

STATISTICAL INFERENCE. 

Chapters 25 — 26. Generalities. 

Chapter 25. Preliminary notions on sampling.323 

1. Introductory remarks. 一 2. Simple random sampling. 一 8. The distribu¬ 
tion of the sample. 一 4. The sample values as random variables. Sampling 


XIV 











p«g« 


di«tribntioxi8. — 5. Statiatical image of a distribution. — 6. Biased sampling. 
Random sampling numbers. — 7. BnmpliDg without replacement. The 
representative method. 

Chapter 26. Statistical inference...332 

1. Introductory remarks. 一 2. Agreemeni between theory and facta. Tests 
of significance. 一 8. Description. 一 4. Analysis. 一 6. Prediction. 


Chapters 27 一 29. Sampling Distributions. 


Chapter 27. Characteristics of sampling distributions.341 

1. Notations. — 2. The sample mean x, 一 3. The moments a v . — 4. The 


variance m % . 一 6. Higher central moments and semi.invariants. — 0. Un¬ 
biased estimates. — 7. Functions of moments. — 8. Characteristics of multi¬ 
dimensional di8tribntioo8. 一 9. Corrections for grouping. 

Chapter 28. Asymptotic properties of sampling distributions.. 363 

1. Introductory remarks. 一 2. The momeDts. 一 3. The central momeDts. 

一 4. Functions of moments. — 6. The qaantiles. 一 0. The extreme Talnes 


and the range. 

Chapter 29. Exact sampling distributions.378 

1. The, problem. — 2. Fisher’s lemma. Degrees of freedom. 一 8. The joint 
distribution of x and in samples from a normal distribution. 一 4. Stu¬ 


dent's ratio. 一 5. A lemma. 一 6. Sampling from a two-dimensional normal 
distribution. 一 7. The correlation coefficient. 一 8. The regression coeffici¬ 
ents. 一 9. Sampling from a ^-dimensional normal distribution. 一 10. The 
generalized variance. 一 11. The generalized Student ratio. — 12. Regression 
coefficients. 一 13. Partial and multiple correlation coefficieots. 


Chapters 30 一 31. Tests of Significance, I. 

Chapter 30. Tests of goodness of fit and allied tests.416 

1. The x* test in the case of a completely specified hypothetical distribu¬ 
tion. 一 2. Examples. — 3. The test when certain parameters>are estimated 
from the sample. — 4. Examples. — 6. CoDtingency tables. 一 6. aa a test 
of homogeneity. 一 7. Criterion of differential death rates. 一 8. Further 
tests of goodness of fit. 

Chapter 31. Tests of significance for parameters.452 

1. Tests based on standard errors. 一 2. Testa based on exact distributions. 

一 3. Examples. 

Chapters 32 ― 34. Theory of Estimation. 

Chapter 32. Classification of estimates.473 

1. The problem. — 2. Two lemmaa. 一 8. Minimum variance of an estimate. 


XV 










Efficient estimates. 一 4. Sufficient estimates. 一 6. Aeymptoticallj efficient 
estimates. 一 6. The case of two unknown parameters. 一 7. Several unknown 
parameters. 一 8. Generalisation. 

Chapter 33. Methods of estimation.497 

1, The met bod of moments . —⑦ The method of maximum likelihood. 一 
^^Aaymptotic properties of maximum likelihood estimates. •— 4. The y} 
minimum method. 

Chapter 34. Confidence regions. . .507 

1. Introductory remarks. 一 2. A single unknown parameter. 一 8. The 
general case. — 4. Examples. 

Chapters 35 — 37. Tests op Significance, II. 

Chapter 35. General theory of testing statistical hypotheses... 525 

1. The choice of a test of Bignificaoce. 一 2. Simple and composite hy¬ 
potheses. 一 3. Tests of simple hypotheses. Moat powerful teata. 一 4. Un- 


biMed tests. 一 6. Testa of composite hypotheses. 

Chapter 36. Analysis of variance.536 

1. Variability of mean valaea. 一 2. Simple grouping of variables. 一 3. 
Oeneralizatioo. 一 4. Kandomized blocks. — 5. Latin squares. 

Chapter 37. Some regression problems.548 


1. Problems involving non random variables . — 2. Simple regression. 一 3. 
Multiple regression. 一 4. Farther regression problems. 


Tables 1 — 2. The Normal Distribution.557 

Table 3. The ^^Distribution.559 

Table 4. The t-DiSTRiBUTiON .560 

List of References. 561 

Index. 571 


XVI 












FIRST PART 


MATHEMATICAL INTRODUCTION 




Chapters 1-3. Sets op Points. 


CHAPTER 1. 


General Properties of Sets. 

1.1. Sets. 一 In pure and applied mathematics, situations often 
occur where we have to consider the collection of all possible objects 
having certain specified properties. Any collection of objects defined 
in this way will be called a set, and each object belonging to such a 
set will be called an element of the set. 

The elements of a set may be objects of any kind : points, num¬ 
bers, functions, things, persons etc. Thus we may consider e. g. 1) the 
set of all positive integral numbers, 2) the set of all points on a 
given straight line, 3) the set of all rational functions of two variables, 
4) the set of all persons born in a given country and alive at the 
end of the year 1940. In the first part of this book we shall mainly 
deal with cases where the elements are points or numbers, but in this 
introductory chapter we shall give some considerations which apply 
to the general case when the elements may be of any kind. 

In the example 4) given above, our set contains a finite, though 
possibly unknown, number of elements, whereas in the three first 
examples we obviously have to do with sets where the number of 
elements is not finite. We thus have to distinguish between finitr 
and infinite sets. 

An infinite set is called enumerable if its elements may be arranged 
in a sequence: x ny . . such that a) every x n is an element 

of the set, and b) every element of the set appears at a definite place 
in the sequence. Bj such an arrangement we establish a one-to-one 
correspondence between the elements of the given set and those of the 
set containing all positive integral numbers 1, 2, . . ., w, . . which 
forms the simplest example of an enumerable set. 

We shall see later that there exist also infinite sets which are 
non-enumerable. If, from such a set, we choose any sequence of ele¬ 
ments x if x iy . . there will always be elements left in the set which 
do not appear in the sequence, so that a non-enumerable set may be 

3 



1.1-3 


said to represent a higher order of infinity than an enumerable set. 
It will be shown later (cf 4. 3) that the set of all points on a given 
straight line affords an example of a non-enumerable set. 

1.2. Subsets, space. 一 If two sets S and S x are such that every 
element of S L also belongs to S 、 we shall call S { a subset of S y and 
write 

5, < 5 or S > 5,. 

We shall sometimea express this also by saying that 5, is contained 
in S or belongs to S. 一 When S 1 consists of one single element x y 
we use the same notation x < S to express that x belongs to S. 

In the particular case when both the relations < S and S <. S x 
hold, the sets are called equal, and we write 

S =* S { . 

It is sometimes convenient to consider a set S which does not 
contain any element at all. This we call the empty set, and write 
5 = 0. The empty set is a subset of any set. If we regard the empty 
set as a particular case of a finite set, it is seen that every subset of a 
finite set is itself finite、while every subset of an enumerable set is finite 
or enumerable. Thus the set of all integers between 20 and 30 is a 
finite subset of the set 1 , 2, 3, . . while the set of all odd integers 
1 ， 3, 5, ... is an enumerable subset of the same set. 

In many investigations we shall be concerned with the properties 
and the mutual relations of various subBeta of a given set S. The 
set S 、which thus contains the totality of all elements that may 
appear in the investigation, will then be called the space of the in¬ 
vestigation. If, e.g.，we consider various sets of points on a given 
straight line, we may choose as our space the set S of all points on 
the line. Any subset S of the space S will be called briefly a set in S. 

1.3. Operations on sets. 一 Suppose now that a space S is given, 
and let us consider various sets in S. We shall firat define the opera¬ 
tions of addition、multiplication and subtraction for sets. 

The sum of two sets S x and S tt 

^ + S t1 

is the set S’ of all elements belonging to at least one of the sets 
and S r 一 The product 


4 



1.3 


^ = s, s t 

is the common part of the sets, or the set 5" of all elements belonging 
to both S t and S f . 一 Finally, the difference 

S” = & 一 5, 

will be defined only in the cage when S t is a subset of S l} and is 
then the set S ,,f of all elements belonging to but not to S r 

Thus if S x and 5, consist of all points inside the curves C, and 
C t respectively (cf Fig. 1 )， S t + S 9 will be the set of all points inside 
at least one of the two curves, while S t S t will be the set of all points 
common to both domains. 



The product S, S t is evidently a subset of both and S t . The 
difference 5 n — where n may denote 1 or 2， is the set of all 

points of S n which do not belong to S x S t . 

In the particular case when 民 and <S 2 have no common elements, 
the product is empty, so that we have S x S t = 0. On the other hand, 
if S x = S t the difference is empty, and we have S l —'S i = 0. 

In the particular case when S t is a subset of we have & 十 5* = 
and S t =*= S t . 

It follows from the symmetrical character of our definitions of the 
sum and the product that the operations of addition and multiplica¬ 
tion are commutative, i. e. that we have 

+ S t = S f -r S x and S t = <S, S v 
5 



1.3 


Further, a moment s reflection will show that these operations are 
also associative and distributive, like the corresponding arithmetic 
operations. We thus have 

(5, -f S f ) + S $ = + (S M + S 9 ) y 

(S.S^S^S^S^ 

+ SJ = S t S M + S t Sg. 

It follows that we may without ambiguity talk of the sum or product 
of any finite number of sets: 

4* S t 七 …+ S n and S t … S n 、 


where the order of terms and factors is arbitrary. 

We may even extend the definition of these two operations to an 
enumerable sequence of terms or factors. Thus, given a sequence 
5j, S t , . . . of sets in 5, we define the sum 

00 

^ = 氏 + < S t + •’ 

i 

as the set of all elements belonging to at hast one of the sets S 9y 
while the product 

« 

J^S，= Si s，•. 

1 

is the set of all elements belonging to all S v . — We then have, e. g. f 
S(S l -f- S t + • ••)=： SS t + SSf 十 . • • 

Thus if St denotes the set of all real numbers x such that ― -- ^ a* ^ we 

v + 1 v 

ao 

find that 2 S* will be the set of all x such that 0 < x ^ l, while the product sot 
l 

00 

will be empty, JJ 5* » 0. 一 On the other hand, if SV denotes the set of all x such 
l 

QO O) 

that 0 ^ ^ ^, the sum 2 名， will coincide with S lt while the product JJ Sv will 

l i 

be a set coxuainiog one single element, riz. zbe number x — 0. 

For tue operation of subtraction, an important particular case 
arisen when S x coincides with the whole space S. The difference 


6 



1.3-4 


s* = s — s 

in the set of all elements of our space which do not belong to S, and 
will be called the complementary set or simply the complement of S. We 
obviously have 5 + S* = S ， SS* = 0, and (5*)* = S. 

It is important to observe that the complement of a given set S 
is relative to the space S in which S is considered. If our space is 
the set of all points on a given straight line L, and if S ia the set 
of all points situated on the positive side of an origin 0 on this line, 
the complement S* will consist of 0 itself and all points on the 
negative side of 0. If, on the other hand, our space consists of all 
points in a certain plane P containing L, the complement S* of the 
same set S will also include all points of P not belonging to L . —— 
In all cases where there might be a risk of a mistake, we shall use 
the expression: 5* is the complement of S with respect to S. 

The operations of addition and multiplication may be brought into 
relation with one another by means of the concept of complementary 
sets. We have, in fact, for any 6nite or enumerable sequence S” S 2 ,.. 
the relations 

n on (a + \ + . r = s ： s ：-, 

⑻ V •)• = 5 ； + s ： + . 

The first relation expresses that the complementary set of a sum ts the 
product of the complements of the terms. This is a direct consequence 
of the definitions. As a matter of fact, the complement (5, 4 - )* is 
the set of all elements x of the space, of which it is not true that 
they occur in at least one S v . This is, however, the same thing as 
the set of all elements x which are absent from every S、，or the set 
of all x which belong to every complement S: 、 i. e. the product 

5: . . The second relation is obtained from the first by substituting 
SI for S v . 一 For the operation of subtraction, we obtain by a similar 
argument the relation 
(1.3.2) 5, - 5 f == S, SI 

The reader will find that the understanding of relations such as 
(1.3.1) and (1.3.2) is materially simplified by the use of figures of the 
same type as Fig. 1. 


1.4. Sequences of sets. 一 When we use the word sequence without 
further specification，it will be understood that we mean a finite or 



1.4 

enumerable sequence. A sequence S lt S t 、 ••” S ni . • • will often be 
briefly called the sequence { 氏 }. 

When we are concerned with the sum of a sequence of sets 

5 = /Sj + Sj + …， 

it is sometimes useful to be able to represent S as the sum of a 
sequence of sets such that no two have a common element. 

This may be effected by the following transformation. Let us put 

K 


Thus Z v is the set of all elements of S v not contained in any of the 
preceding sets S Xi . . 5, 一 i. It is then easily seen that and Z v 
have no common element, as soon as fi 〆 v. Suppose e. g. ft <v\ 
then Zy, is a subset of while is a subset of SJ, so that 
Z(t Z v = 0. 

Let us now put 5' = Z, + Z, 十 Since Z* < S v for all v, we 
have S* C S. On the other hand, let x denote any element of 5. By 
definition, x belongs to at least one of the Let S n be the first set 
of the sequence S u S ti . . . that contains x as an element. Then the 
definition of Z» shows that x belongs to Z n and consequently also to 
S’. Thus we have both S C S’ and S* < S，so that S' = S and 

5 = Zj + Z, + • 

We shall use this transformation to show that the sum of a sequence 
of enumerable sets is itself enumerable. If S v is enumerable, then Z v 
as a subset of S v must be finite or enumerable. Let the elements of 
be Xviy .... Then the elements of S 2 S v = 2 Z v form the 
double sequence 

x n 怎 is 

怎 i* 怎 ia •. 

x 8t x 38 -- 


8 





.4-5 


and these may be arranged in a simple sequence e. g. by reading alon^ 
diagonals: x llt x J9t x iu x 19t x n , x 9it .... It is readily seen that every 
element of S appears at a definite place in the sequence, and thus S 
is enumerable. 

1.5. Monotone sequences. 一 A sequence S x 、 S t 、. . . is never de- 
creaatng y if we have S n < S n ^i for all w. If, on the contrary, we have 
S n > S n +i for all n, the sequence is never increasing. With a common 
name, both types of sequences are called monotone. 

For a never decreasing infinite sequence, we have 


= 2 s，、 


and this makes it natural to define the limit of such a sequence by 
writing 

00 

lim5n= y 5,. 

n-*oo 

Similarly, we have for a never increasing sequence 


= JJ S 9i 


and accordingly we define in this case 

ao 

lim S n = TT S v . 

Thns if Sn denotes the set of all points (x, y, z) inside the sphere x* + y* + 
-f r 1 = 1 一 the sequence Si, . . . will be never decreasing, and lim Sn will be 
the set of all points inside the sphere x* + y* + z* *= 1. On the other band, if Sn denotes 
the set of all points inside the 9 phere x* + y* + 1 + -, the sequence will be 

" ti 

never increasing, and lim Sn will consist of all points belonging to the inside or the 
surface of the sphere x* + y* + *= 1. 

It is possible to extend the definition of a limit also to certain types of sequences 
that are not monotone. We shall, however, have do occasion to use such an exteniion 
in this book. 


9 



1 . 6 - 2.1 


1.6. Additive classes of sets. — Given a space <5, we may consider 
various classes of in S. We shall make an important use of the 
concept of an additive class of sets in S. A class S of seta in S will 
be called additive 1 ), i£ it satisfies the following three conditions : 

a) The whole space S belongs to S. 

b) If every set of the sequence S u S t} . . . belongs to (£, then the 
sum S t H- 5, + … and the product S t S t . . . both belong to 6. 

c) If S t and 5, belong to ®, and S 9 < S v then the difference 
S t — S t belongs to (S, 

If 6 is an additive class, we can thus perform the operations of 
addition, multiplication and subtraction anj finite or enumerable 
number of times on members of @ without ever encountering a set 
that is not a member of 6. 

It may be remarked that the three above conditions are evidently 
not independent of one another. As a matter of fact, the relations 
(1.3.1) and (1.3.2) show that the following is an entirely equivalent 
form of the conditions: 

a, ) The whole space S belongs to S. 

b, ) If every set of the sequence S '、 S t1 . . . belongs to 6, then the 

sum 5! 十 5 倉 H - belongs to S. 

c, ) If S belongs to S, then the complementary set S* belongs to (£. 

The name »additive class» is due to the important place which, in 
this form of the conditions, is occupied by the additivity condition bj). 

The class of all possible subsets of S is an obvious example of 
an additive class. In the following chapter we shall, however, meet 
with a more interesting case. 


CHAPTER 2. 

Linear Point Sets. 

2.1. Intervals. — Let our space be the set R t of all points on a 
given straight line. Any set in R t will be called a linear point set. 

*) In this book, we shall always use the word »additiTe* in the same sense m in 
this paragraph, i.e. with reference to a finite or enumerable sequence of terms. It 
may be remarked th»t some authors QMe in this sense the expression »completely 
additive*, while »additive» or »simply additiTe* U used to denote » property essent* 
ially restricted to a finite number of terms. 

10 



2.1 


If we choose on our line an origin 0, a unit of measurement and a 
positive direction, it is well known that we can establish a one-to-one 
correspondence between all real numbers and all points on the line. 
Thus we may talk without distinction of a point x on the line or 
the real number x that corresponds to the point. We consider only 
points corresponding to finite numbers; thus infinity does not count 
as a point. 

A simple case of a linear point set is an interval. If a and b are 
any points such that we shall use the following expressions to 

denote the set of all x such that: 

a ^ x ^ . . . the closed interval (a, 6); 

a < x < b、• •，the open interval (a, 6); 

a < x S b t • . • the half-open inter val (a, b), closed on the right .， 

a ^ x < b y . . . the half-open interval (a, 6), closed on the left. 

When we talk simply of an interval (a, b) without further specification 
in tbe context, it will be understood that anything that we say shall 
be true for all four kinds of intervals. 

In the limiting case when o = 6, we shall say that the interval is 

degenerate. In this case, the closed interval reduces to a set con¬ 

taining the gingle point x = a % while each of the other three intervals 
is empty. 

If, in the above inequalities, we allow b to tend to + oo, we 
obtain the inequalities defining the closed and the open infinite inter¬ 
val (a, + oo) respectively : 

x^a and x> a. 

Similarly when a tends to 一 co we obtain 
x^b and x < b 

for the closed and the open infinite interval (一 ®, 6). — Finally, the 
whole space R t may be considered as the infinite interval ( 一 oo,oo). 

It will be shown below (cf 4.3) that any non-degenerate interval is a 
non-enumerable net. 

The product of a finite or enumerable sequence of intervals is 
always an interval, but the sum of two intervals is generally not an 
interral. In order to an example of a case when a sum of intervals 

11 



2 . 1-2 


is another interval, we consider w + 1 points a < x x < < x n -\ < b. 
If all intervals appearing in the following relation are half-open and 
closed on the same side, we obviously have 

(a, b) = (a, x t ) -f x 9 ) + + (x»-i, 6), 

and no two terms in the second member have a common point. The 
same relation holds if all intervals are closed, but in this case any 
two consecutive terms have precisely one common point. If all inter 、 
vals are open, on the other hand, the relation is not true. 

2.2. Various properties of sets in R y . 一 Consider a non-empty 
set S. When a point a exists tuch that, for any f > 0, there is at least 
one point of S in the closed interval (a, a -f- e) y while there is none 
in the open interval (— oo, a), we shall call a the lower bound of S. 
When no finite a with this property exists, we shall say that the 
lower bound of S is —— co. In a similar way we define the upper bound 
^ of S. A set is bounded, when its lower and upper bounds are both 
finite. A bounded set 5 is a subset of the closed interval (a, p). The 
points a and ^ themselves may or may not belong to S. 

If e ia any positive number, the open interval (x — e, a: + e) will 
be called a neighbourhood of the point x or, more precisely, the e-neigh¬ 
bourhood of x. 

A point z is called a limiting point of the set ^ if every neigh¬ 
bourhood of z contains at least one point of S different from e. If 
this condition is satisfied, it is readily seen that every neighbourhood 
of z even contains an infinity of points of S. The point z itself may 
or may not belong to S. The Bolzano- Weierstrass theorem asserts that 
every bounded infinite set has at least one limiting point. We assume 
this to be already known. —— If ^ is a limiting point, the set S 
always contains a sequence of points x,, x iy . . . such that x n — ♦方 
as n -► oo . 

A point a; of 5 is called an inner point of S if we can find e such 
that the whole ^-neighbourhood of x is contained in S. Obviously an 
inner point is always a limiting point. 

We shall now give some examples of the concepts introduced above. 一 】n the 
first place, let S b« a finite non-degenerate interval (a, 6). Then a is the lower bound 
and b is the upper bound of S. Every point belonging to the closed interval (a, b) 
is a ltmitiog point of S, while every point belonging to the open interval (a, b) is an 
inner point of S. 


12 



2.2-3 


Consider now the set JR of all rational points x ■ 
interiral 0 < x 彡 1. If we write the sequence 


: p/q belonging to the half-open 


i. i. i. 


and then discard all numbers p/q such that p and q have a common factor, every 
point of R will occur at precisely one place in th© seqaenc«, Mid hence R is 
enumerable. There are no inner points of 12. Erery point of the closed intorTal 
(0,1) is • limiting point. 一 The complement JS* of R with respect to the half-op«n 
interval 0 < sc ^ 1 is the set of all irrational points contained in that interval. R* 
is not an enumerable set, as in that case the interval (0,1) would be• the sum of two 
enumerable sets and thns itself enumerable. Like R itself, R* has no inner points, 
and every point of the closed interval (0,1) is a limiting point. 

Since R is enamerable, it immediately follows that the set Rn of all rational 
points x belonging to the half open interval n < x S n + l is, for every positive or 
negative integer n, an enumerable set. From a proposition proved in 1.4 it then 
follows that the set of all positive and negative rational numbert %$ enumerable. The 
latter set is, in fact, the sum of the sequence {Rn) f where n Msomea all positive 
aod negative integral valaes, and is thus by 1.4 an enumerable set. 

2.3. Borel sets. 一 Consider the class of all intervals in R x 一 
closed, open and half-open, degenerate and non-degenerate, finite and 
infinite, including in particular the whole space R, itself. Obviously 
this is not an additive class of sets aa defined in 1.6, since the sum 

of two intervals is generally not an interval. Let ns try to build up 

an additive class by associating further sets to the intervals. 

As a first generalization we consider the class 3 of all point sets 
I such that I is the sum of a finite or enumerable sequence of intervals. 
If /j, I ty . . . are sets belonging to the claw 3 ， the sum Jj 4- J t -f • • • 
is, by 1.4, also the sum of a finite or enumerable sequence of inter¬ 
vals, and thus belongs to 3 - The same thing holds for any finite 
product J x J, . . . on account of the extension of the distributive 
property indicated in 1.3. We shall, however, show by 'examples that 

neither the infinite product . nor the difference I x — J f neces¬ 

sarily belongs to 3. In fact, the set R considered in the preceding 
paragraph belongs to 3, since it is the Bum of an enumerable sequence 
of degenerate intervals, each containing one tingle point p/q. The 
difference (0,1) — JS, on the other hand, does not contain any non¬ 
degenerate interyal, and if we try to represent it as a turn of degenerate 

13 





2.3 


intervals, a non.enumerable set of such intervals will be required. 
Thus the difference does not belong to the class 3. Further, this 
difference set may also be represented as a product /, 7 t . . where 
I n denotes the difference between the interval (0,1) and the set con¬ 
taining only the with point of the set R. Thus this product of sets 
in 3 does not itself belong to the class 3. 

Though we shall make in Ch. 4 an important use of the class 3, 
it is thus clear that for our present purpose this class is not sufficient. 
In order to build up an additive class, we must associate with 3 
further sets of a more general character. 

If we associate with 3 all sums and products of sequences of sets 
in 3, and all differences between two sets in 3 such that the difference 
is defined — some of which sets are, of course, already included in 
J — we obtain an extended class of sets. It can, however, be shown 
that not even this extended class will satisfy all the conditions for 
an additive class. We thus have to repeat the same process of asso¬ 
ciation over and over again, without ever coming to an end. Any 
particular set reached during this process has the property that it 
can be defined by starting from intervals and performing the opera¬ 
tions of addition, multiplication and subtraction a finite or enumerable 
number of times. The totality of all sets ever reached m this way is 
called, the class of Borel sets in K x , and this is an additive class. As 
a matter of fact, every given Borel set can be formed as described 
by at most an enumerable number of steps, and any sum, product or 
difference formed with such sets will still be contained in the class 
of all sets obtainable in this way. 

Thus any sum, product or difference of Borel sets is itself a 
Borel set. In particular, the limit of a monotone sequence (cf 1.5) of 
Borel sets is always a Borel set. 

On the other hand, let ^ be any additive class of sets in H, con¬ 
taining all intervals. It then follows directly from the definition of 
an additive class that ii must contain every set that can be obtained 
from intervals by any finite or enumerable repetition of the operations 
of addition, multiplication and subtraction. Thus (£ must contain the 
whole class 乳 of Borel sets, and we may say that the class is thr 
smallest additive class of sets nr R t that includes all intervals. 


14 



3.1 


CHAPTER 3. 

Point Sets in n Dimensions. 

3.1. Intervals. 一 Just as we may establish a one-to-one corres¬ 
pondence between all real numbers x and all points on a straight line, 
it is well known that a similar correspondence may be established 
between all pairs of real numbers (x t1 x t ) and all points in a plane, or 
between ail triplets of real numbers (x u x t1 «r s ) and all points in a 
three-dimensional space. 

Generalizing, we may regard any system of n real numbers 
(x v x f , . . x n ) as representing a point or vector x in an euclidean space 
R n of n dimensions. The numbers x ly . . x n are called the coordinates 
of x. As in the one-dimensional case, we consider only points cor¬ 
responding to finite values of the coordinates. —— The distance be¬ 
tween two points 

X = (x v . . Xn) and y = . . . t y n ) 

is the non-negative quantity 

\x—y \ = V(x t — y x Y -f •• -f (x n — yn) 1 . 

The distance satisfies the triangular inequality: 

Let 2 n numbers a u . • • 、 a n and 6„ . . b n be given, such that 
^ b v for v = 1, . • n. The set of all points * defined by b v 

for y = 1， . . w is called a closed n-dimensional interval. If all the signs 
^ are replaced by <, we obtain an open interval, and if both kinds 
of signs occur in the defining inequalities, we have a half open inter¬ 
val. In the limiting case when a 9 = b r for at least one value of y, 
the interval is degenerate. When one or more of the a, tend to — oo ， 
or one or more of the K to + oo, we obtain an infinite interval. As 
in 2.1, the whole space R n may be considered as an extreme case of 
an infinite interval. 

It will be shown below (cf 4.3) that any non-degenerate interval 
is a non-enamerable set. The product of a finite or enumerable sequence 
of intervals is always an interval, but the sum of two intervals is 
generally not an interval. 


15 



3.2-4 


3.2. Various properties of sets in H n . — A set S in R n is bounded^ 
if all points of S are contained in a finite interval. 

If a = (a,, . . M a n ) is a given point, and e is a positive number, 
the set of all points x such tbat \x — a| < £ is called a neighbourhood 
of a or, more precisely, the b neighbourhood of o. 

The definitions of the concepts of limiting point and inner point, 
and the remarks made in 2.2 in connection with these concepts for 
the caae w => 1, apply without modification to the general case here 
considered. 

We have seen in 2.2 that the set of ftll rational points in is enumerable. By 
means of 1.4 it then follows th»t the set of all points with rational coordinates in a 
plane is enumerable, and farther by induction that the set of all points in Rn with 
rational coordinates is enumerable. 


3.3. Borel sets. 一 The class of all intervals in R„ is, like the 
corresponding class in not an additive class of sets. In order to 
extend this class so as to form an additive class we proceed in the 
same way as in the case of intervals in Hj. 

Thus we consider first the class 3n of all sets I that are sums of 
finite or enumerable sequences of intervals in H n . If J” ij, . . . are 
sets belonging to this class, the sum /, 4 - 1, + • and the finite pro¬ 
duct I... In also belong to 2U. As in the case n = 1, the infinite 
product and the difference I x — /, do not, however, always 

belong to 

We thus extend the class 3» by associating all auiris，products and 
differences formed by means of sets in 3n. Repeating the same as¬ 
sociation process over and over again, we find that any particular set 
reached in this way has the property that it can be defined by start¬ 
ing from intervals and performing the operations of addition, multi¬ 
plication 纛 nd subtraction a finite or enumerable number of times. 
The totality of all sets eve)' reached in this way is called the class S w 
of Borel sets in K„, and this is an additive class. 

In tbe same way as in the case w = 1, we find that the class 93 M 
is the smallest additive class of sets in R n that includes all intervals. 

3.4. Linear seu. — When n > 3, the set of all point* in R n which 
satisfy a single equation F(x X) . . a? n ) = 0 will be called a hypersurface. 
When F is a linear function, the hypersurface becomes a hyperplane. 
The equation of a hyperplane may always be written in the form 

a v (x v 一 + . + a n (x n — rw n ) =» 0 ， 

16 



3.4-5 


where m = (m lt . . m„) is an arbitrary point of the hjperplane. — 

Let 

(3.4.1) Hi = an(x i 一 m,) + . . . + a in (x n — w») = 0, 

where t = 1, 2, . . p, be the equations of p hjperplanes passing 
through the same point m. The equations (3.4.1) will be called linearly 
independent, if there is no linear combination 1c x H x + . . . + H v with 
constant hi not all = 0, which reduces identically to zero. The cor¬ 
responding hyperplanes are then also said to be linearly independent. 

Suppose p < n, and consider the set L of all points in R n common 
to the p linearly independent hyperplanes (3.4.1). If (3.4.1) is con¬ 
sidered as a system of linear equations with the unknowns a?,, . . x n , 
the general solution (cf 11.8) is 

Xi = tiU -h Cnt L + • • + Ci,n-p t n - Pi 

where the are constants depending on the coefficients a,it, while 
之 " • • ” tn - p are arbitrary parameters. 

The coordinates of a point of the set L may thus be expressed 
as linear functions of n —p arbitrary parameters. Accordingly the 
set L will be called a linear set of n — p dimensions^ and will usually 
be denoted by L n - P . For jp = 1, this is a hjperplane, while for = w — 2 
L forms an ordinary plane, and for p = n 一 1 a straight line. 一 

Conversely, if Ln - p is a linear set of w — jp dimensions, and if 

m = (m u . . w»n) is an arbitrary point of L n - P ' then L n - P may be 

represented as the common part (i. e. the product set) of p linearly 

independent hjperplanes passing through m. 

3.5. Subspace, product space. — Consider the space R n of all 
points x = [x x , . . ocn). Let us select a group of ^ < w coordinates, 
say x Jf . . Xk, and put all the remaining n — k coordinates equal to 
zero: Xk^i = - = a: n = 0. We thus obtain a system oi n — k linearly 

independent relations, which define a linear set Lk of h dimensions. 
This will be called the Ar-dimensional subspace corresponding* to the 
coordinates x lt . . xu. The subspace corresponding to any other group 
of 是 coordinates is, of course, defined in a similar way. Thus in the 
case w = 3, A =» 2, the two-dimensional subspace corresponding to 
and x % is simply the (x v x f )-plane. 

Let 5 denote a set in the ^dimensional subspace of x u . . . t Xk. 
The set of all points * in R n such that (a?„ . . x ky 0, . . 0) < 5 will 

be called a cylinder set with the base S. 一 In the case n = 3, A; = 2, 

17 



3.5 


this is au ordinary three-dimensioDAl cylinder in the (x t ， x t , a: 8 )-Bpace t 
having the set S in the (x ly a: t )-plane as its base. 

Further, if S t and S 2 are sets in the subspaces of x u . . M a:* and 
xjk+i ， .•” x n respectively, the Bet of all points x in R n such that 
(x„ . . Xk, 0, . . 0) < S x and (0, ."•() ， Xk+u . . x n ) < S 9 will be called 
a rectangle set with the sides S t and S r 一 In the case when n = 2, 
while S x and S t are one-dimensional intervals, this is an ordinary 
rectangle in the (a:^ x,)-plane. 

Finally, let R m and R n be spaces of m and n dimensions respectively. 
Consider the set of all paira of points (*, y) where * = (a^，. . 3Cm) 
is a point in R m , while y = (y 1? . . M y n ) is a point in R n . This set 
will be called the product space of and R n . It is a space of w 十 w 
dimensions, with all points (x t , . . x m , y u . . y n ) as its elements. 一 
Thus for m = n == 1, we find that the a;,)-plane may be regarded 
as the product of the one-dimensional a;,- and o; a -§paces. For m = 2 
and w = 1， we obtain the (x,, x i$ x 8 )-space as the product of the 
(xj, x,)-plane and the one-dimensional a: 8 -space, etc. The extension of 
the above definition to product spaces of more than two spaces ia 
obvious. (Note that the product space introduced here ia something ： 
quite different from the product set defined in 1.3.) 

References to chapters 1—3. 一 The theory of seta of points was founded 
by G. Cantor aboot 1880. It is of t fundament*! importance for many branches ot 
mathematics, such aa the modern theory of integration and the theory of fanctiona. 
Most treatises on these sabjecU contain chapters on sets of points. The reader may 
be referred e.g. to the books by Borel (Ref. 6) »nd d« U VtUe« PouMin (Ref. 40). 


18 



Chapters 4-7. 


Theory of Measure and Integration in R t . 


CHAPTER 4. 

The Lebesgue Measure of a Linear Point Set. 

4.1. Length of an interval. ― The length of a finite interval (a, b) 
in is the non-negative quantity b — a. Thus the length has the same 
value for a closed, an open and a half-open interval with the same 
end-points. For a degenerate interval, the length is zero. The length 
of an infinite interval we define as + oo. 

Thus with every interval i = (a, b) we associate a definite non¬ 
negative length, which may be finite or infinite. We may express this 
by saying that the length L(t) is a non-negative function of the interval 
and writing 

L (i) = b — a f or L(t) = + oo, 

according as the interval i is finite or infinite. 

If an interval i is the sum (cf 2.1) of a finite number of intervals, 
no two of which have a common point: 

i = i x 十 + . + in (tfx ir == 0 for 鮮 〆 y), 

the length of the total interval i is obviously equal to the sum of the 
lengths of the parts: 

FFe now propose to show that this relation may be extended to an enumer- 
“We sequence of parts. To a reader who studies the subject for the 
first time, this will no doubt seem trivial. A careful study of the fol¬ 
lowing proof may perhaps convince him that it is not. — In order 
to give a rigorous proof of our statement, we shall require the fol¬ 
lowing important proposition known as BoieVs lemma: 

We are given a finite closed interval (a, b) and a set Z of tntnxals 
that every point of (a, b) is an inner point of at least one interval 

19 




4.1 


belonging to Z. Then there is a subset Z* of Z containing only a finite 
number of intervals^ such that every point of (a, b) is an inner point of 
at least one interval belonging to Z f . 

Divide the interval (a, 6) into n parts of equal length. The lemma 
will be proved, if we can show that it is possible so to choose n that 
each of the n parts — considered as a closed interval 一 is entirely 
contained in an interval belonging to Z. 

Suppose, in fact, that this is not possible, and denote by i n the 
first of the n parts, starting from the end-point a, which is not en¬ 
tirely contained in an interval belonging to Z. The length of t n ob¬ 
viously tends to zero as w tends to infinity. Let the middle point of 
i n be denoted by x ni and consider the sequence : r” x ,， •… Since this 
is a bounded infinite sequence, it has by the Bolzano-Weierstrass 
theorem (cf 2.2) certainly a limiting point x. Every neighbourhood of 
the point x then contains an interval i n 、 which is not entirely con¬ 
tained in any interval belonging to Z. On the other hand, x is a 
point of (a, b) and is thus, by hypo thesis, itself an inner point of some 
interval belonging to Z. This evidently implies a contradiction, and 
so the lemma is proved. 

It is evident that both the lemma and the above proof may be 
directly generalized to any number of dimensions. 

Let us now consider a sequence of intervals i v == (a 91 b v ) such that 
the sum of all i v is a finite interval t = (a, 6), while no two of the 厶 
have a common point 

« 

t= (ifi iv = 0 for fi v) 

i 

We want to prove that the corresponding relation bolds for the lengths 

(4-1.1) L(t)^2 L W- 

i 

In the first place, the n intervals tj, . . t„ are a finite number of 

N 

intervals contained in so that we have ^L(i y )^ L(i) and hence, 
allowing n to tend to infinity, 1 

J ； /：(»). 

1 

It remains to prove the opposite inequality. This is the non trivial 
part of the proof. 


20 



4.1-2 


Consider the aet Z which consists of the following intervals: 1) the 
intervals u, 2) the open intervals (a — a + •) and [b — 6 + «), 3) the 

open intervftli ~> fl* 4 - 备 ) and (6, 一会 ，&， + |；) ， where v = 1 ， 

2, •. •，while i is positive and arbitrarily small. It is then evident that 
every point of the closed interval (a, fc) is an inner point of at least 
one interrai belonging to Z. According to Borel s lemma we may thus 
entirely cover i by means of a finite number of intervals belonging 
to Z, and the sum of the lengths of these intervals will then certainly 
be greater than L(i) = 6 —a. The sum of all intervals belonging to Z 
will a fortiori be greater than L[i\ so that we have 

2 i W + 4,+ 4^1；- + 8e>L(0. 

Since b is arbitrary, it follows that 

2L(u)^L(z) t 

and (4.1.1) is proved. 

It is further easily proved that (4.1.1) holds also in the case when 
i is an infinite interval. In this case, we have L(i) = 4* <», and if t 0 
is any finite interval contained in i y it follows from the latter part of 
the above proof that we have 

2L(u)^L(i 0 ). 

i 

Siuce i is infinite we may, however, choose / 0 such that L (t 0 ) is greater 
than any given, quantity, and thus (4.1.1) holds in the sense that both 
members are infinite. 

We have thus proved. that y if an interval is divided into a finite or 
enumerable number of intervals without common points, the length of the 
total interval is equal to the sum of the lengths of the parts. This pro- 
perty will be expressed by saying that the length L(t) is an additive 
function of the interval i. 

4,2. Generalization. — The length of an interval is a measure of 
the extension of the interval. We have seen in the preceding para¬ 
graph that this measure has the fundamental properties of being non- 
negative and additive. The length of an interval t is a non.negative 

21 



4 * 1-3 


and additive tnietval function L{%)» The raloe of thi« function may be 
finite or infinite. 

We now ask if it is possible to define a measure with the tame 
fandamental properties aIbo for more complicated set® than interrola. 
With any set S belonging to some more or less general class, we thus 
want to associate a finite or infinite 1 ) number L (S), the measure of 
S y in snob a way that the following three conditions are satisfied : 

a) L(S)^0. 

b) If 5 = -f + • • •, where S, = 0 for p 一 v 、 then we have 
L(S)^L(S t ) + L(S m ) + ••• 

c) In the particular case when S is an interval, i(S) is equal to 
the length of the interval. 

Thus we want to extend the definition of the interval function L (*), 
so that we obtain a non-negative and additive set function L(S) which, 
in the particular case when S is an interval i, coincidea with L (»). 

It mififht well be asked why this extension should be restricted to 
»some more or less general class of Beta，, and why we should not at 
once try to define L(S) for every set S. It can, however, be shown 
that this is not possible. We shall accordingly content ourselves to 
show that a set function L(S) with the required properties can be de¬ 
fined for a class of sets that includes the whole class of Borel sets. 
This set function L(S) is known as the Lebesgue measure of the set S. 
We shall further show that the extension is unique or, more precisely, 
that L(S) is the only set function which is defined for all Borel sets and 
satisfies the conditions a) — c). 

4.3. The measure of a sum of intervals. 一 We shal) first define 
a measure L(I) for the sets I belongin 钇 to the class 3 considered in 
2.3. E?erj set in 3 is the sam of a finite or enumerable sequence of 
intervals and, by the transformation used in 1.4, w© can alwaya take 
these intervals such that no two of them bave a common point. (In 
fact, if the seta S, considered in 1.4 are intervals, every Z m will be 
the sum of ft finite number of intervals without common points.) 

Any set in 3 may thus be represented in the form 

(4.3.1) + 

*) For the ■•( function L (S), and the more general aet functions considered in Ch. 6, 
we shall admit th« existence of infinite values. For teto of pointa and for ordinary 
functions, on the other hand, we shall only deal with ioflnitj in the mom of a limits 
but not 瓢 t an independent point or value (of 2.1 »nd 8.1). 

22 



4.3 


where the u are intervals such that u = 0 for 弘 〆 v. By the condi¬ 
tions b) and c) of 4.2, we must then define the measure L(I) by 
writing 

(4.3.2) L(I) = L (/,) + X (t f ) + • • •, 

where as before L (t») denotes the length of the interval t v - 

The representation of I in the form (4.3.1) is, however, obviously 
not unique. Let 

(4.3.3) I^j t + •** 

be another representation of the same set J, the being intervals 
such that jfijv = 0 for fi v. We must then show that (4.3.1) and 

(4.3.3) yield the same value of L (/), i. e. that 

(4.3.4) = 

H—l »=»1 

This may be proved in the following way. For any interval we 
have, since i, t C J, 

V == i/u / = tfi — 2 hj ， 

v y 

and thus, by the additive property of the length of an interval, 

(v) ^ 2 ^ 

♦* 

(4.3.5) 21 (〜> = 2 21 (&)，)• 

In the same way we obtain 

(4.3 6) 22^，) = 22 从以 ). 

** ¥ (A 

Now the following three cases may occur: 1) The intervals i h j v are 
all finite, and the double series with non-negative terms 2 j*) 

convergent. 2) All the are finite, and the double series is divergent. 
3) At least one of the i Mi / y is an infinite interval. 

In case 1)，the expressions in the second members of (4.3.5) and 

(4.3.6) are finite and equal, and thus (4.3.4) holds. In cases 2) and 3) 
the same expressions are both infinite. Thus in any case (4.3.4) is 

23 



4.3 


proved, and it follows that the definition (4.3.2) yields a uniquely de¬ 
termined 一 finite or infinite 一 value of L(I). 

It is obvious that the measure L(I) thus defined satisfies the condi¬ 
tions a) and c) of 4.2. It remains to show that condition b) is also 
satisfied. 

Let be a sequence of sets in 3, such that /♦ = 0 for 

p 〆 y, and let 

A = 2 
¥ 

be a representation of in the form used above. Then 

4 = 2 2 ‘ 

ia also a set in 3, and no two of the have a common point. If 
i\ t \ ... is an arrangement of the double series 2 a simple se- 

quence (e. g. by diagonals as in 1.4), we have 

I = i i" -f - …， 

L(/) = L(0 + L(O-f •• • 

A discussion of possible cases similar to the one given above then 
shows that we always have 

^ ^ 

We have thus proved that (4.3.2) defines for all sets I belonging to 
the class 3 « unique measure L(I) satisfying the conditions a) *— c) of 4.2. 

We shall now deduce some properties of the measure L(I). In 
the first place, we consider a sequence J,,/,,... of sets in 3, without 
assuming that / ^ and I y have no common points. For the sum 
/ •= /j + /, + •••, we obtain as above the representation t’’ + + …， 

but the intervals i\ i f \ ... may now have common points. By the 
transformation used in 1.4 it is then easily seen that we always have 

L(/)^L(0 + ^(0+ 

which gives 

(4.3.7) 抓 +i(J f ) + 

(In the particular case when /» = 0 for ^ we have already seen 
that the sig^n of equality holds in this relation.) 

24 



4.3 


We further ob»erve that any enumerable iet of points x u x t ，… 
is a set in % since each x n may be regarded as & degenerate interval, 
the length of which reduces to zero. It then follows from the de¬ 
finition (4.3.2) that the measure of an enumerable $et is always equal to 
eero. 一 Hence we obtain a simple proof of a property mentioned 
above (1.1 and 2.1) without proof: the set of all points belonging to a 
non degenerate interval is a non enumerable set. In fact, the measure of 
this set is equal to the length of the interval, which is a positive 
quantity, while any enumerable set is of measure zero. A fortiori, 
the same property holds for a non-degenerate interval in H, with 
n > 1 (cf 3.1). 

Finally, we shall prove the following theorem that will be required 
in the sequel for the extension of the definition of measure to more 
general classes of sets : If I and J are sets in 3 that are both of finite 
measure, we have 

(4.3.8) L(/+ J) = L(/)4 - L(J)-L(1J). 

Consider first the case when I and J both are sums of a finite 
number of intervals. From the relations 

/ -f = / + (J 一 IJ) t 

we obtain, since all sets belong to 3, and the two terms in each sec¬ 
ond member have no common point ， 

L(I+J) = L(I)^L{J-IJ) t 
L(J) = L(IJ) + L(J-IJ) } 

and then by subtraction we obtain (4.3.B). 

In the general case, when I and J are sums of finite or enumerable 
sequences of intervals, we cannot argue in this simple way, as we are 
not sure that / — /«7 is a set in 3 (cf 2.3) and, if this is not the case, 
the measure L(J—IJ) has Dot yet been defined. Let 

J = 2 ‘ ， ^ ^ 2)， 

/M=l **-i 

be representations of I and J of the form (4.3.1), and put 
In 2 V» ^ 2^* 

u**l 


25 



4.3-4 


According to the above, we then have 

L(ln + /„) = L (/„) + L (J„) - L [In Jn ). 

Allowing- now n to tend to infinity, each term of the last relation 
tends to the corresponding term of (4.3 8), and thus this relation is 
proved. 

4.4. Outer and inner measure of a bounded set. 一 In the pre¬ 
ceding paragraph, we have defined a measure L(/) for all sets I be- 
longring to the class 3 In order to extend the definition to a more 
general class of sets, we shall now introduce two auxiliary functions, 
the inner and outer meaimre, that will be defined for every bounded 
set in R t . 

Throughout this paragraph, we shall only consider bounded sets. 
We choose a fixed finite interval (a, b) as our space and consider only 
points and sets belonging- to (a, b). When speaking about the com¬ 
plement S* of a set S, we shall accordingly always mean the com¬ 
plement with respect to (a, b). (Cf 1.3.) 

In order to define the new functions, we consider a set I belonging 
to the class 3， such that S f c ： (a, b). Thus we enclose the set 5 in 
a sum I of intervals, which in its turn is a subset of (a, b) This can 
always be done, since we may e. g. choose I = (a, b) The enclosing 
set / has a measure L (/) defined in the preceding paragraph. Consider 
the set formed by the numbers L(l) corresponding to all possible en¬ 
closing sets 1. Obviously this set has a finite lower bound, since we 
have L{1) ^ 0. 

The outer measure L(S) of the set S trill be defined as the htver 
bound of the set of all these numbers L(I) The inner meimiro L(S) of 
S will be defined by the relation L (S) = b-a — L (iS*). 

Since every set S considered here is a subset of the interval (a, ft), 
which is itself a set in % we obviously have 

0 ^ L(S) ^ br-a, 0^L(S)^b-a. 

Directly from the definitions we further find that Z(5) and 1^(S) are 
both monotone functions of S t i. e. that we have 

(4.4.1) Z ⑻ s L(S t \ MS,), 

as soon as S { < S t . In fact, for any / such that S t < /, we then also 
have S x C /, and hence the first inequality follows immediately. The 

26 



4.4 


second inequality is obtained from the first by conaideriug the com¬ 
plementary sets. 

Further, if S and S* < / f> every point of (a, b) belongs to at 
least one of the sets I x and I t . Since I x and / 2 are both contained 
in (a, 6), we then have ii + /* — («, b) and thus by (4.3.7) 

L(I X )^ L(l t )^b-a. 

Choosing the enclosing sets I t and I t in all possible ways, we find 
that the corresponding inequality must hold for the lower bounds of 
L(I X ) and L(I t ), so that we may write 

L(S) + L{S*)^b-a 
or 

(4.4.2) L(S)^L(S). 

Let S it .. . be a given sequence of sets with or without common 
points. According to the definition of outer measure, we can for every 
n find J n such that S n < In and 

L(/i«) < L (Sn) + ^ ' 

where e is arbitrarily small. We then have S t + S f . c ： ij 十 + , 

and from (4.3.7) we obtain 

L (Si S t + • ) ^ L (T t + If -r ) 

< Z(S,) + L(S t ) + +£(* + ! + •)• 

Since b is arbitrary, it follows that 

(4.4.3) Z(^j *f S t 4 - • •) ^ L(S t ) 4- L(S t ) + •. 

Jn order to deduce a corresponding ： inequality for the inner measure 
we consider two sets and S 9 without comynon points. Let the 
complementary sets Si and 5J be enclosed in I x and /, rtffcpectivelj. 
Abbreviating the words »lower bound of» by »1. b.», we then have 

b-a L (5j) = L (St) = 1. b. L (/,), 

(H4) b—a — L (S t ) = L (S?) == 1. b. L(I t ) f 

where the enclosing sets /j and l t have to be chosen in all possible 
way*. Farther, we have by (1.3.1) 

27 



4.4 


(«i + s t r«sfsj</t/„ 

but here we can only infer that 

(4.4.5) b^a 一 L(S t + Sfj = L [(5*4 + S%)*] ^ 1 . b. L(I t /*), 

since there may be other enclosing /-seta for (5 t + S，、* besides those 
of the form I v I t . From (4.4.4) and (4.4.5) we deduce, using (4.3.8), 

L(S X + S t ) - L(S i )-L(S t )^\.b. [£(/,) + L(/,)]-l.b. L(I t I t )- (6-a) 

2 1. b. [Ld,) + L(I t ) ~L (/ t / t )]~ (b-a) 

=1. b. L(I t + /j) 一 (b—a). 

Since S t and S % have no common point we hare, however, S x S n = 0 
and J 4 -f / g > + 5J = (S t 5,)* = (a, ft). On the other hand, /, and 

/ s are both contained in (a, b), bo thftt I x + I t < (fl, b). Thus ij + /!= 
(a, ft), and 

L(S 1 + S t ) S i(«i) + L(St). 

Let now S lt S 2 , ... be a sequence of sets, no two of which have a 
common point. By a repeated use of the last inequality, we then obtain 

(4.4.6) L(St + ^ MSt) -f L(S % ) + . 

In the particular case when S is an interval, it is easily seen from 
the definitions that Zj-S 1 ) and are both equal to the length of 

the interval. If I = 2 u is a set in 3, where the u are intervals with¬ 
out common points, we then obtain from (4.4.3) and (4.4.6) 

L(I)^SL(i 9 \ L(I)^^L(ul 

and thus by (4.4.2) and (4.3.2) 

(4.4.7) Z(/)== i(/) = Z(/). 

Finally, we observe that the outer and inner measures are inde- 
pendent of the interval (a f b) in which we have assumed all our sets to 
be contained. By 2.2, a bounded set S is always contained in the 
closed interval (a, /?), where a and /? are the lower and upper bounds 
of S. If (a, b) is anj other interval containing S, we must have a^a 
and 6^/5. A simple consideration will then show that the two inter¬ 
vals (a, 6) and (a, /?) will yield the same values of the outer and inner 
measures of S. Thus the quantities L(S) and KS) depend only on 
the set 5 itself, and not on the interval (a, b). 

28 



4.5 


4.5. Measurable sets and Lebesgue measure. 一 A' bounded set £ 
rill be called measurable, if its outer and inner measures are equal, 
rheir common value will then be denoted by L(S) and called the 
Lebesgae measure or simply the measure of S- 

L(S) = L(S)^L(S). 

An unbounded set S will be called measurable if the product i x S 、 
where t x denotes the closed interval (— x } x) } is measurable for every 
c > 0. The measure L(S) will then be defined by the relation 

L(S)^iimL(i x S). 

X — ► QO 

By (4.4.1), L(i x S) is a never decreasing function of x. Thus the limit 
which may be finite or infinite, always exists. 

In the particular case when S is a set in 3， the new definition of 
measure is consistent with the previous deBnition (4.3.2). For a bounded 
set /, this follows immediately from (4 4.7). For an unbounded set 
we obtain the same result by considering the bounded set i z T and 
allowing x to tend to infinity. 

According to (4.4.1), L(S) and L(S) are both monotone functions 
of the set S. It then follows from the above definition that the same 
holds for L(S). For any two measurable sets and S 2 such that 
S x < S 2 we thus have 

(4.5.1) < X(5 2 ). 

We shall now show that the measure L(S) satisfies the conditions 
aY-c) of 4.2. — With respect to the conditions a) and c), this follows 
directly from the above, so that it only remains to prove that the 
condition b) is satisfied. This is the content of the following theorem. 

If S X} 6a» . .. are measurable sets, m tiro of which have a common 
point, then the sum -f 5 a 4- is also measurable, and tee hare 

(4.5.2) L(S t + + ••) = £(&) + L(S Z ) + . • ■ 

Consider first the case when S Xi ... are all contained in a finite 
interval (a, b). The relations (4.4.3) and (4.4.6) then give, since all the 
S n are measurable, 

L(S X + + • • •) ^ + L(S 2 ) + • . = L(Si) + L(S*) + • ， 

.. ( 馬 } + £(&) +•• 

By (4.4.2) we have, however, ^ L(S t 十 5, + .. )，and thus 

29 



4.5-6 


尤⑹ + 馬十 } = 厶 (馬+ 5 意十. .） =石(夺|) 十 L (5，） 十…， 

so that in this case our assertion is true. 

In the general case, we consider the products i x S x , i x S 2l ..all of 
which are contained in the finite interval «'*. The above argument then 
shows that the product t x (S 1 + + ) is measurable for any a?, and that 

L [i x (S 1 十& + )] == L (i x Si) + L(t x S M ) + •• • 

Then, by definition, S t + S t -i- is measurable and we have, since 
every term of the last series is a never decreasing 1 function of x t 

L(S X + S % •) = lini [L(i x -^i) + L[i x S t ) + ] 

x~* oo 

=L (Sj) + L (S 2 ) • 

Thus (4.5.2) is proved, and the Lebesgue measure L(6 T ) satisfies all 
three conditions of 4.2. 

A set S such that L (5) = 0 is called a set of measure zero. If the 
outer measure Z(<S) = 0, it follows from the definitioD of measure that 
S is of measure zero. We hare seen in 4.3 that, in particular, any 
enumerable set has this property. 一 The following two propositiong 
are easily found from the above. Any subset of a set of measure eero 
is itself of measure zero. The sum of a sequence of sets of measure zero 
is itself of measure zero. — TheBe propositions are in fact direct con¬ 
sequences of the relations (4.4.1) and (4.4.3) for the outer measure. 

4.6. The clast of measurable sets. — Let us consider the class 
y of all measurable sets in Hj. We are goings to show that fi is an 
additive class of sets (cf 1.6). Since we have seen in the preceding 
paragraph that 2 contains all interirals, it then follows from 2.3 that 
8 contains the whole class of all Borel sets, so that all Bot'd sets 
are measurable. 

We shall, in fact, prove that the class Q satisfies the conditions 

hi) and c x ) of 1.6. With respect to ^), this is obviou*, bo that we 
need only consider hi) and Ci). 

Let us first take c x ). It is required to show that the complement 
of a measurable set S is itself measurable. Consider fir«t the caae of a 
bounded set S and its complement 5* with respect to ■ome finite 
interval (a, b) containing S. Bj the definition of inner measure (4.4) 
we then have, since S is measurable, 

£t(S*) = b—a — X(5) = b-a 一 L(S) = 

30 



4.6 


so that S m is measurable, and has the measure b—a 一 L (S). — In the 
general case when S is measurable but not necessarily bounded, the 
same argument shows that the product i*5*, where S* is now the 
complement with respect to the whole space H lt is measurable for 
any a: > 0. Then, by definition,is measurable. 

Consider now the condition b x ). We have to show that the sum 
Si + . of any measurable sets S Xi S 2 、...is ttself measurable. 一 - 

In the particular case when S fi S v = 0 for fi v t this has already been 
proved in connection with (4.5.2), but it still remains to prove the 
general case. 

It is sufficient to consider the case when all S n are contained in 
a finite interval (a, fe). In fact, if our assertion has been proved for 
this case, we consider the sets i z S l} i*S 2 , . • •，and find that their sum 
tx(Sx + 5 2 + •) is measurable for any x > 0 Then, by definition ， 

S x + + is measurable. 

We thus have to prove that, if the measurable sets S x 、 S % ' . . . are 
all contained in (a, b), the sum Si + 5 2 + is measurable. 

We shall first prove this for the particular case of only two sets 
and iV 2 . Let n denote any of the indices 1 and 2, and let the 
complementary sets be taken with respect to (a, fj). Since S„ and S* 
are both measurable, we can find two sets I„ and J n in S such that 

(4.6.1) Sndn< (a, 6), (a, b) y 

while the differences L(/ n ) ~ L(S n ) and L (J n ) — L (S*) are both smaller 
than any given « > 0. Now by (4.6.1) any point of (a, 6) must belong 
to at least one of the sets I n and J n , so that we have /» + J n == (a, b) f 
and thus by (4.3.8) 

L(l，J n )^L(In)^ L(J n )-(b-a) 

(4.6.2) . =L(/„) + L(J„)-L(Sn)- L'(S：) <2e. 

It further follow^ from (4.6.1) that 


and hence 
(4.6.3) 


Sy S 2 + /«» 

V (s 1 + s 2 r^s ： si<j x j 2t 

L (S\ + ^ £(/i + /j), 

L ； S X + S t )^b-a-L(J x J 2 ). 

f 


By the same ar^umen^ as before, we find that 乃 + / 2 + Jx J 2 = (a, 
The relations (4.6.3) t!\ give, using once more (4.3.8), 




4 . 6-7 


X(«1 + + 馬) S L[(/x + JJJxJJ. 

Now 

(A + ItW hJx^ ItJt 、 

so that we obtain by means of (4.5.1), (4.3.7) and (4.6.2) 

L(S 1 + S t )-L(S 1 4- S t )^ + Z(/ 1 J t )<4«. 

Since e is arbitrary, and since the outer measure is always at least 
equal to the inner measure, it then follows that Z(5 X + S t ) = X»(Si + 5 篇 ), 
so that S x -f S % is measurable. 

It immediately follows that any sum S x + + S n of a finite number 

of measurable sets, all contained in (a, 6), is measurable. The relation 
SiS 2 .. • S v = (S* + ■ •+ 5J)* then shows that the same property holds 
for a product. 

Consider finally the case of an infinite sum. By the transforma¬ 
tion used in 1.4, we have S = S x + S % ■{- =■ Z x -b Z % , where 

Z, = Sf … S:-i S ，，and Z fi Z v = 0 for ^ 〆 少 . Since Si, ..<S?-i and 
S, are all measurable, the finite product Z, is measurable. Finally, by 
(4.5.2), the sum + Z % - is measurable. 

We have thus completed the proof that the measurable sets fot-m a» 
additive class 2. It follows that any sum, product or difference of a 
finite or enumerable number of measurable sets is itself measurable. In 
particular，all Borel sets are measurable. 


4.7. Measurable sets and Borel sets. — The class S of measurable 
sets is, in fact, more general than the class ^8 X of Borel sets. As an 
illustration of the difference in generality between the two classes, 
we mention without proof the following proposition. Any measurable 
set is the sum of a Borel set and a set of measure zero. All sets oc¬ 
curring in ordinary applications of mathematical analysis are, how¬ 
ever, Borel sets, and we shall accordingly in genera’ restrict ourselves 
to the consideration of the class 93 lt and the corresponding class S n 
in spaces of n dimensions. j 


We shall now prove the statement made in 4.2 that the Lebesgue 
measure is the only set function defined /w all Borel sets and satisfying 
the conditions a)--c) of 4.2. 

Let, in fact, -4(S) be any set function safsfyin^ all the conditions 
just stated. For any set I in 3, we must obviSuslj have A [I) = jL(/), 
since our definition (4.3.2) of L(I) was direct!/ imposed by the condi¬ 
tions b) and c) of 4.2. Let now 5 be a bounded Borel set, and en. 


32 



4 / 7 — 5.1 


close 5 in a sum I of intervals. From the conditions a) and b) it then 
follows that we have A[S)^ A (I) == L(I). The lower bound of L{I) 
for all enclosing I it equal to L(S\ and so we have ^4(5) ^ L(5). 
Aepl&cing S by its complement S* with respect to some finite interval, 
we have A (S*) ^ L (S*), and hence ud(S) ^ L(S). Thus A (S) and 
i(S) are identical for all bounded Borel sets. This identity holds even 
for unbounded sets, since anj unbounded Borel set may obviously be 
represented as the sum of a sequence of bounded Borel sets. 

We shall finally prove a theorem concerning the measure of the 
limit (cf 1.5) of & monotone sequence of Borel sets. By 2.3, we know 
that any such limit is always a Borel set. 

For a non-decreasing sequence S lt S ti ... of Borel sets we have 

(4.7.1) limX (^) = 2,(11111 S n ). 

For a non-increasing sequence, the same relation holds provided that L (^) 
is finite. 

For a non-decreasing sequence we may in fact write 

lim 5» = /Sj 4 (5 a 一 <8^} + (6’ 3 — 4 - , 

and then obtain by (4.5.2) 

L(lim S n ) ^L(S X ) + L(S 2 ~5 1 )+ • 

=lim [L(6\) + L (*S 2 — $i) + 十 (&« 一 一 !)j 

= lim L(S n ). 

For a non-increasing sequence such that L(S t ) is finite, the same rela¬ 
tion is proved by considering the complementary sets S* with respect 
to S lt — The example S n = (w, + oc) shows that the condition that 
L(Sj) should be finite cannot be omitted. 

I 

^ CHAPTER 5. 

The Lebesgue Integral for Functions of One Variable. 

S.l. The Integral of a bounded function over a set of finite measure. 

一 All point sets considered in the rest of this book are Borel sets, un¬ 
less expressly stated othermse. 1 ) Generally this will not be explicitly men¬ 
tioned, and should then always be tacitly understood. 

*) In order to give a full .account of the theory of the Lebeague integral, it would 
be necessmry to consider med^urable sets, and not only Borel sets. As stated in 4.7 
the restriction to Borel sets is, howerer, amply sufficient for our purposes. 

33 



ft.l 


Let 5 be a given set of finite measure L (S), and g (a?) a function of 
the real variable x defined for all values of x belongfing to S. We 
shall suppose that g(x) is bounded in S y i. e. that the lower and upper 
bounds of g(x) in S are finite. We denote these bounds by m and M 
respectively, and thus have m^g{x)^M for all x belonging to S. 
Let U8 divide S into a finite number of parts S t% .. S ni no two 
of which have a common point, bo that we ha^e 

S Si + . + Sn ， (Sfi S r — 0 for fA r). 

In the set S v , the function g{x) haa a lower bound m v and an upper 
bound such that m ^ m* ^ ^ M. 

We now define the lower and upper Darboux sums associated with 
this division of 5 by the relations 

(5.1.1) Z^2 i M v L(S v ). 

i i 

It is then obvious that we have 

mL(S)^z^Z^ML(8). 

It is also directly seen that any division of S supe^'posed on the above 
division, i.e. any division obtained by subdivision of some of the parts 
S v 、 will ffive a lower Bum at least equal to the lower sum of the 
original division, aud an upper sum at most equal to the upper sum 
of the original division. 

Any division of S in an arbitrary finite number of parts without 
common points yields, according to (5.1.1 )， a lower sum z and an 
upper sum Z. Consider the set of all possible lower sums z t and the 
set of all possible upper sums Z. We shall call t^|ese briefly the 方 -get 
and the Z net. Both seta are bounded，since all ' Mxnd Z are situated 
between the points mL(S) and ML(S). We stM now show that the 
upper bound of the r-set is at most equal to the lower bound of the Z-set. 
Thus the two sets have at most one commo/a point, and apart from 
this point, the entire z net is situated to the left of the entire Z-set. 

In order to prove this statement, let an arbitrary lower sum, 

corresponding to the division S = + - if while Z n is an ar¬ 

bitrary upper sum, corresponding to the diHsion S == S f { 4 h 

It is then clearly sufficient to prove that to have e ^ Z H . This fol- 

34 j 



5.1 


lows, however, immediately if we consider the division S ■ 


n" 


which is superposed on both the previous divisions. If the corres¬ 
ponding Darboux sums are and Z 0 , we have by the above remark 
z f ^ Z r \ and thus our assertion is proved. 

The upper bound of the 方 -set will be called the lower integral of 
g(x) over S } while the lower bound of the Z-sefc will be called the 
upper integral of gr(x) over S. We write 


J g(x)dx = upper bound of e-set, 


(5.1.2) 




f g[x)dx = lower bound of /-set 


It then follows from the above that we have 

(5.1.3) mL(S)^ f g(x)dx ^ f g(x)dx ^ ML(S). 

s_ s 

If the lower and upper integrals are equal (i. e. if the upper bound 
of the 名 -set is equal to the lower bound of the Z-set), g (a?) is said to 
be integrable in the Lebesgue sense over S % or briefly integrable over 6 
The common value of the two integrals is then called the Lebesgue 
integral of g[x) over S t and we write 

J g (x) dx = J g(x) dx j g (x) dx. 
s s 

A necessary and sufficient condition for the inte^rability of g(x) 
over 5 is that, every c > 0, we can find a division of N such that 

the corresponding difference Z 一 z is smaller than e. In fact, if this 
condition is satisfied, it follows from our definitions of the lower and 
upper integrrals [ that the difference between these is smaller than e, 
and since e is arbitrary, the two integrals must be equal. Conversely, 
if it is known that g(x) is integrable, it immediately follows that 
there must be one lower sum z and one upper sum Z*\ such that 
Z n — z < e. The divi|ion superposed on both the corresponding divi¬ 
sions in the manner Ironsidered above will then give a lower sum 
and an upper sum such that Z 0 一 z 0 < e. 

It will be seen thek all this is perfectly analogous to the ordinary text¬ 
book definition of the Ufeemaww integral In that case, the set S is an 
interval which is divic |d into a finite number of sub-intervals S,, and 


35 



ft.l 


the Darboux sums z and Z are then formed according to (5.1.1)，where 
now L(5») denotes the length of the v:th sub-interval S v . The only 
difference is that, in the present case, we consider a more general 
class of sets than intervals, since S and the parts S v may be any 
Borel sets. At the same time, we have replaced the length of the 
interval S, by its natural generalization, the measure of the set S v . 

In the particular case when S is a finite interval (a, 6), any division 
of (a } b) in sub intervals considered in the course of the definition of 
the Riemann integral is a special case of the divisions in Borel sets 
occurring in the definition of the Lebesgue integral. In the latter case, 
however, we consider also divisions of the interval (a, b) in parts which 
are Borel sets other than intervals. These more general divisions may 
possibly increase the value of the upper bound of the .e-set, and re¬ 
duce the value of the lower bound of the Z-set. Thus we see that 
the lower and upper integrals defined by (5 1.2) are situated between 
the corresponding Biemann integrals. If g(x) is inte^rable in the Rie- 
mann sense, the latter are equal, and thus a fortiori the two integrals 
(6 1.2) Are equal, so that ^(a:) is also integrable in the Lebesgue sense, 
with the same value of the integral. When toe are concerned with func¬ 
tions tntegrablc tn the Biemann sense, and with integrals over an interval, 
it ts thus not necessary to distinguish between the tiro kinds of integrals. 

The definition of the Lebeague integral of course, somewhat more complicated 
than the definition of the Hiemann integral The introduction of this complication is 
justifled by the fact that the properties of the LcbeHgue integral are simpler than 
I hone of the RiemAnn integral. 一 In order to sho>\ by an example that the Lebesgue 
integral exists for a mure general class of functions than the Riemann integral, we 
(•onstder a function g(x) equal to 0 从 hen x is irrational, and to 1 when x la rational. 
In every non-degenerate interval this function han the lo^er bound 0 and the upper 
bound 1 The lower »nd upper Darboux sums occurring in the definition of the Rie¬ 
mann integral of g(T) ov«r the interval (0, 1) are thus, for any division in auli- 
iiitervals, e<inal to 0 and 1 respectively, so that the Riemann integral does not exist. 
If, on the other hand, we divide the interval (0, 1) into the iwo parts S t and Sr, 
containing respectivelj the irrational and the rational number of the interval, y(x) 
im equal to 0 everywhere in S t , and to 1 everywhere in feV. Further, S t has the 
measure 1, and Sr the measure 0, no that both Darboux suix^s (6.1.1) corresponding to 
thin diviHion arc equal to 0 Then the lower and upper/integrals (6.1.2) are both 
equal to 0, ami thus the IiCbeague integral of g (x) over (0, U^xwta and has the value 0. 

The Lebesgue integral over an interval (a, o!j is usually written in 
the same notation as a Riemann integral : \ 

h 

I g{x)dx. 


36 



5.1-2 

« 

We shall see below (cf 5.3) that this integral has the same value 
whether we consider (a, b) as closed, open, or half open 一 In the 
particular case when g(x) is continuous for a ^ x ^ 6, the integral 

r 

0 (x) = jg(t)dt 

a 

exists as a Riemann integral, and thus a fortiori as a LebeB^ue inte 
gral, and we have 

(5.1.4) G f (x)=^g(x) 

for all x in (a, b). 


5.2. B- measurable functions. 一 A function g(x) defined for all x 
in a set S is said to be measurable tn the Borel se?tse or B measurable 
in the set S if the subset of all points 2 ; in 5 such that ^ (a:) ^ A: is 
a Borel set for every real value of k We shall prove the following 
important theorem : 

If g (x) ts bounded and B-measurable in a set S of finite measure, 
then g{z) is mtegrable over S. 

Suppose that we have m < g(x) ^ M for all x belonging to S Let 
f > 0 be given, and divide the interval (w, M) in 9 ub-intervak by 
means of points y v such that 

—Vo < y\ < < < .Vn — M. 

the length of each sub-interval being < t Obviously this can always 
be done by taking n sufficiently large Now let 5, denote the set of 

all points x belonging to S such that 

/ 

< ^(x) ^ y,, (y ^ 1, 2, , n) 

Then S ^ S t ^ 1 4- 5 W , and S^S v = 0 for 弘 〆 v Further, S v is the 
difference betwevi the two Borel sets defined by the inequalities 
g (jc) < if v and < 9 ( //,-i respectively, so that -S\ is a Borel set. The 

difference M v — m v ^between the upper and lower bounds of g (x) in 5, 
is at most equal tc\ t/v — < s. Hence we obtain for the Darboux 

sums corresponding 1 \) this division of S 

Z 一 z = (Afv — m r ) L(S ¥ ) < c ^ £ (5 V ) = « L (S) 

But e is arbitrarily smaS|l, and thus by the preceding paragraph g (.c' 
is integrable over S 


37 



5.2 


The importance of the theorem thus proved follows from the fact that 
all functions occwring in ordinary applications of mathematical analysis 
are B-mea»urable. 一 Accordingly f we shall in the sequel only consider 
B-measurable functions. As in the case of the Borel sets、this will 
generally not be explicitly mentioned、and should then always be tacitly 
understood. 

W© shall here only indicate the main lines of the proof of the above statement, 
referring for farther detail to special treatises, e. g. de la Vallee Poussin (Ref. 40). 
We first consider th« case when the Mfc 5 is a finite or infinite interval (fl, 6), and 
write simply »S*mea8urftble» instead of *£t-mea8Drable in (a, b)». If g\ and g% are 
JB-meMarable fanctions, the §nm g x -f g %> the difference g x — g t and the product q% 
are also ^-measarable. We sli 纛 11 giye the proof for the case of the 8um，the other 
cases being proved in a similar way. Let k be given, and let U denote the set of 
all x in (a, b) such that g x -\- g% ^ Jc, while XJ r r and d«not« the sets defined by the 
inequalities g x S，• and 夕•运 Jt 一 r respcctirelj. Then by hypothesis lT r and V r * are 
Borel sets for any ralaes of k and r, and it will be verified without difficulty that 
we have U — 11(^ + TJ ^)、where r runs through the enumerable sequence of all 
positive sod negative rational numbers. Hence by 2.3 it follows that 27 is a Borel 
set for any value of 1e t mnd thus g x 4 - g% is jB-meaanrable. — The extension to tb© 
sum or product of a finite number of B-measurable functions is immediate. 

Consider now an infinite snm g — 9\ + 9% -V ' • ot B-measurable functions, as¬ 
sumed to be convergent for any x in (a, b). Let fj, f*，• . . b« a decreasing sequeDce 
of positive numbers tending to zero, and let Qmn denote the set of all x in («, b) 
such that g x + - +gm g A: + fn. Then Qmn is a Borel s©t, and if we put 

Rmn ^ Qmn Qm+l,n Un = jKln + Rin + * • •, U •= V\Ut •. ” 

•ome reflection will show that U is th© set of all x in (a, b) such that g{x) ^ A*. 
Since only sums and products of Borel sets have been used, Z7 is a Borel and 

g(x) is B roeasnrftble. — Farther, if g is the limit of a convergent tequrnce gi, 外 ， • •. 
of ^ measurable functions, we may writ® 夕 = 仍 + 以，一， l) ' ( 夕 s 一外 ） + …， 
thus g is 万 .measurable. 

Now it is evident that the function g{x) « c x n is B-measurable for any constant 
c and any non-negative integer n. It follows that any polynomiiil is J3-me»8ur*bl©. 
Any continuous fanction is tbe limit of n convergent sequence otj polynomials, and is 
thus ^-measurable. Similarly all functions obtained by limit proo^sses from continuous 
functions are B-measurablc. / 

By arguments of this type, our statement is proved for f 上 e cate when S is an 
interval. It g (x) is ^B-measurable in {a f b), and 5 is any Bordl set in (fl, b), the func¬ 
tion t(x) equal to 1 in S, and to 0 in S* f is evidently jB-nyftiBurftble in (a, b). Then 
the product t (a;) g (x) is JB-measurable in {a, 6), and this inoi fces that g(x) is 5-m©ftaur- 
able in S. — If, in particular, S is the set of all x in (a, 1 ) such that g (x) S 0, we 
have I〆®)I = ^( x ) — 2 e{x)g{x). Thus the modulas of a -measurable function is 
itself ^-measurable. 、 

When we are dealing with B-measurable functions, all the ordinary 
analytical operations and limit processes will thus lead to ^-meaBor- 

38 



5.2-3 


able functions. By the theorem proved above, any bouhded function 
obtained in this way will be integrable in the Lebesgue sense over 
any set of finite measure. For the Riemann integral, the corresponding 
statement is not true, 1 ) and this is one of the properties that renders 
the Lebesgue integral simpler than the Riemann integral. 

We shall finally add a remark that will be used later (cf 14.5) 
Let g(x) be jB-measurable in a set S. The equation y ^ g[x) defines 
a correspondence between the variables x and y. Denote by T a given 
set on the t/axis, and by X the set of all x in S such that 
it = (fix) < Y. We shall then say that the set X corresponds to Y 
It is obvious that, if Y is the sum, product or difference of certain 
sets Y lf Yg, . • then X is the sura, product or difference of the cor¬ 
responding- sets Xj, X it . . . Further, when Y is a closed infinite inter¬ 
val ( 一 oo, 灸 )， we know that X is a Borel set. Now any Borel set 
may be formed from such intervals by addition, multiplication and 
subtraction. It follows that the set X corresponding to any Borel set Y 
is a Borel set. 


5.3. Properties of the integral. 一 In this paragraph tre consider 
only bounded functions and sets of finite measure. 一 The following 
propositions (5.3.1) 一 (5.3.4) are perfectly analogous to the corresponding 
propositions for the Eiemann integral and are proved in the same 
way as these, using the definitions given in 5.1 : 


(5.3.1) 

f (Oxix) -f g t {x))dx = f g x (r )dx ■¥ j g t [x)dx. 
s s s 

(5.3.2) 

f cg(x)dx = c f g{x 、 dx 、 
s s 

(5.3.3) 

mL(S) < f g(x)dx^ ML(S), 
s 

(5.3.4) 

f g(x)dx = f g(x)dx + f g(x)dx t 


St+St 


*) Even if the limit g{x) of a sequence of functions integrable in the Riemann 
sense ia bounded in an interval (a, fe), we cannot assert that the Riemann integral of 
9 (x) oyer (a,b) exists. Consider, e. g., the sequence g it g % t .. where g n is equal to 
1 for all rational numbers x with a denominator < it, and otherwise equal to 0 
Obvioasly g n is integrable in the Riemann gense over (0,1), but the limit of g n when 
oo is the function g{x) equal to 1 or 0 according x is rational or irrational, 
and we have teen in the preceding paragraph that Kkt Riemann integral of this 
faueiion owet (0,1) does not exist/ 


39 



5.3 

where c is a constant, m and M denote the lower and upper bounds 
of g(x) in S. while S t and S t are two sets without common points. 
(5.3.1) and (5.3 4) are immediately extended to an arbitrary finite 
number of terms. 一 If we consider the non-negatiye functions 
I 夕 W 丨土 g[x) y it follows from (5.3.3) that we have 

(5.3.5) I j g(x)dx\ ^ / \g(x)\dx. 

s s 

In the particular case when g[x) is identically equal to 1, (5.3.3) gives 

f dx = L(S). 

s 

It further follows from (5.3.3) that the integral of any bounded g(x) 
over a set of measure zero is always equal to zero. By means of (5.3.4) 
we then infer that, if g x (x) and g t (a;) are equal for all a; in a set S, 
except for certain values of x forming a subset of measure zero, then 

f g,{x)dx = f g t (x)dx. 
s s 

Thus if the values of the function to be integrated are arbitrarily 
changed on a subset of measure zero, this has no influence on tbe 
value of the integral. We may even allow the function to be com¬ 
pletely undetermined on a subset of measure zero. We also see 
that, if two sets S t and S 2 differ by a set of measure zero, the inte¬ 
grals of any bounded g(x) over and are equal. Hence follows 
in particular the truth of a statement made in 5 1, that the value of 
an integral over an interval is the same whether the interval is closed, 
open or half-open. 

It follows from the above that in the theory of the Lebesgue 
integral we may often neglect a set of measure zero. If a certain 
condition is satisfied for all x belonging to some set S under con¬ 
sideration, with the exception at most of certain values of x forming 
9, subset of measure zero, we shall say that the condition is satisfied 
almost everytvhei-e in S or for almost all values of x belonguig to S. 

We shall now prove an important theorem due to Lebesgue con¬ 
cerning the integral of the limit of a convergent sequence of func¬ 
tions. We shall say ^Vftt a sequence g x (x) t g t (x), . . . is uniformly 
bounded in the set there is a constant K such that \g v (x)\< K 
for ail v and for all x in S. 


40 



5.3 

If the sequence {p v (x)| is uniformly bounded in S, and t/lim g*(x)^g(x) 
exists almost everywhere in S, we have 

(6.3.6) lim f g ¥ (x)dx^ f g(x)dx. 

If lim g ¥ (x) does not exist for &11 x in 8 } we complete the defini¬ 
tion of 0 (a:) by putting g(x) = 0 for all x such that the limit does 
not exist. We then have \g(x)\^ K for all a: in 5, and it follows 
from the preceding paragraph that g(x) is ^-meatTirable in S and is 
thus integrable over S. Let now « > 0 be given, and consider the set 
S n of all x in S such that \g 9 (x) 一夕 (a?)| S e for y = n, w *f 1,.... 
Then S n is a Borel set, the sequence S '、. . . is never decreasing, 
and the limiting set lim S n (cf 1.5) contains every x in S such that 
lim g ¥ (x) exists. Thus by hypothesis lim S n baa the same measure as 
S 、 and we have by (4.7.1) 

lira L (S n ) = L (lim S n ) == L(S). 

We can thus choose n such that L(S n )> L{S) — e, or L(5 — S n ) < 
and then obtain for all v ^ n 

f |^(a ?)~f + f < € [L(iS) + 2K]. 

^ Sn 8-S n 

Since e is arbitrary, and since 

1/ g*(x)dx — f g(x)dx \ ^ j\g v {x) — g{x)\dx y 
a a s 

this proves our theorem. 

The theorem (5.3.6) can be stated in another form as a theorem 
on term-by term integration of a series: 

oo 

If the series 2 A( x ) converges almost everywhere in S, and if the 

i 

n 

partial sums ^ A( x ) are uniformly bounded in S, then 

l 

(5.3.7) / ( 辛 /， ( z )) 2 JM x ) dx ' 

Under this form, the theorem appears as a generalization of (5.3.1) 
to an infinite number of term®. We shall now 麄 how that a corres 


41 



5 * 3-4 


ponding generalization of (5.3.4) may be deduced as a corollary from 

(5.3.7) . 

^ 5 = 5, + /S', + • • •, where S^S 9 = 0 for fi v t then 

(5.3.8) J g(x)dx =2 J 9 ( x )^ x - 

s 1 s ¥ 

Let e 9 (x) denote a function equal to 1 for all x in S ¥ and other¬ 
wise equal to zero. For any x belonging to /S, we then hare 

00 

咖 =2 e -( x ) 穿 ㈤ ， 

l 

and it is obvious that the partial sums of this series are uniformly 
bounded in S. Then (5.3.7) gives 

g(x)dx=^'2i J ^(x)g(x)dx = ^ J g(x)dx. 

1 s 1 s 9 

In the particular case g(x) = 1, (5.3.8) reduces to the additivity 
relation (4.5.2) for Lebesgne measure. 



s 


5.4. The Integral of an unbounded function over a Mt of finite 
measure. 一 In 5.1 and 5.2 we have seen that the Lebesg^ae integral 

/ g(x)dx 


has a definite meaning under the two assamptions that 1) g(x) is 
bounded in S 、and 2) S is of finite measure. We shall now try to 
remove these restrictions. In this paragraph, we consider the case 
when /S is still of finite measure, but g(x) is not necessarily bounded 
in S. 

Let a and b be any numbers such that a < b f and put 




if 




9 (x) < a, 
a^g(x)^b } 
g(x) > b. 


Obviously ga,b(x) is bounded and B-measarable in 5, and thus inte- 
grable over 8. If the limit 


42 



5.4 


(5.4.1) lim f g a ,b(x)dx= f g(x)dx 

IZIZ S s 

exists and has a finite value, we shall say that g{x) is integidble over 
S. This limit is then, by definition, the Lebesgue integral of g(x) 
over S. 

It follows directly from the definition that any function is inte- 
^rable over a set of measure zero, and that the value of the integral 
is zero, os in the case of a bounded function. 

In the definition (5.4.1), we may assume a < 0, 6 > 0, and then have 

^,6(^) = ^,6 = ^,0 + 分 0,6 ， 

U W |a, 6 = Ma, 6 = — g-b. 0 + ^o, b. 

For fixed x ， g a , o (a:) and g 0 , b (x) are never decreasing functions of a 
and b respectively. It follows that both g(x) and | ^ (x) | are integrable 
if, and only if, the limits 

(5.4.2) lim f g Qi o(x)dx and lim f g 0t b(x)dx 

a-* — oo ^ b-* + od 

are both finite. Hence the integrability of p(cr) is equivalent with the 
integrability of |^(x)|. It further follows that, if g(x) is integrable 
over S, it is also integrable over any subset of S. 

If, for all x in S, we have | (7 (a:) | < G(x), where G (x) is integrable 
over S 、we have | ^| n , 6 ^ G a ,^ so that | 夕 (x) | and thus also g[x) are 
integrable over S. 

We now immediately find that the properties (5.3.2) — (5.3.5) of the 
integral hold true for any integrable g(x). With respect to (5.3.3) it 
should, of course, be observed that one of the bounds m and M，or 
both, may be infinite. 

We proceed to the generalization of (5.3.1), which is a little more 
difficult. Suppose that f(x) and g{x) are both integrable over S. From 

1/ 十夕 |a, 0 二 0, |/+ ^|o.6 ^ |/|o, 6 + I ^ |o, ft, 

it follows that f(x) + g(x) is also integrable. We have to show that 
the property (5.3 1) holds in the present case, i. e. that 

(5.4.3) /(/ fj)dx = f fdx + f gdx. 

s s s 

Suppose in the first place that / and g are both non-negative in S. 
Then 


43 



5.4 

(/ 9 )^, 11 ~ fo . 0 (7«, 0 0, 

(/ + ^)o, 6 ^fo.h + go f b^ (/+ 

and hence 

j* {f ff)a, hdx ^ J* fa, b d X + j ffn, ^ X ^ J [/* 4* (/)a # 36 (/ .1*. 

S S S S 

Allowing a and 6 to tend to their respective limits, we obtain (5.4.3). 

一 Now S may be divided into at most six subsets, no two of which 

have a common point, such that in each subset none of the three 
functions /, g and f 七 g changes its sign. For each subset, (5.4.3) is 
proved by the above argument. Adding the results and using (5.3.4) 
we obtain (5.4.3) for the general case. 

We ha?e thus shown that all the properties (5.3.1) 一 (5.3.5) of the 
integral hold true in the present case. In order to generalize also 
the properties expressed by the relations (5.3.6) — (5.3.8)，we shall first 
prove the following lemma: 

If g(x) is integrdble over S 0 、 and ife>0 is given, we can always 
find d > 0 such that 

(5.4.4) |J 夕 t 

for every subset S < S 0 which satisfies the condition L (S) < d. 

Since we have seen that (5.3.5) holds in the present case, it is 
sufficient to prove the lemma for a non-negative function g(x). In 
that case 

j gdx= lim J go,hdx 、 

and thus we can find b such that 

0 ^ J (g — <jo.b)dx< 

Since the integrand is non-negative, it follows by means of (5.3.4) 
and (5.3.3) that we have for any subset S < S 0 

I {gio,h)dx < ie 

s 

or 

J gdx < j^go.bdx + is ^bL(S) ^ J c. 

£ 

Choosing d== 9 ^, the truth of the lemma follows immediately. 


44 



5.4-5 


A consequence of the lemma is that» if g(x) it integrable over an 

r 

interval (a, 6), the integral f g(t)dt is a continuous function of x for 

a 

a < x < b. 

We can now proceed to the generalization of (5.3.6). Asguming 
that lim g v (x) = g (x) almost everywhere in S t we shall show that the 

9-*fC 

relation 

(5.4.5) lim f g v (x)dx = f g(x)dx 

*-* s s 

holds if the sequence {^ v (x)) is uniformly dominated by an integrate 
function, i. e. if | 夕， (x)| < G (x) for all v and for all x in S f where G(x) 
is integrable over S. 一 In the particular case G (x) = const” this re¬ 
duces to (5.3.6). 

The proof is quite similar to the proof of (5.3.6). We first observe 
that it follows from the hypothesis that | p (x) | ^ G (x) almost every¬ 
where in S; thus g v (x) and are integrable over 5. Given « > 0, 
we then denote by S n the set of all a: in such that | 分， (: r) 一夕 (x)| S « 
for all v^n. Then 没 " 沒 ,, ... is a never decreasing sequence, and 
L(S n )-* L(S). Using lemma (5.4.4), we now determine 6 such that 
J G (x) dx < e for every S’ C S with < 3, and then choose n such 

that L(S n ) > L(S) — d } and consequently L(S — S n ) < (J. We then ob¬ 
tain for all v^n 

f | p ， (工)一咖 )1 如 =/ + / 

S S n s-s n 

<eL(S) + 2f G(x)dx<i[L(S)-b2i 
s-s n 

and thus (5.4.5) is proved. — The corresponding generalization of 
(5.3.7) and (5.3.8) is immediate 

5.5. The integral over a set of Infinite measure. 一 We shall now 
remove also the second restriction mentioned at tbe beginning of 5 4, 
and consider Lebesgue integrals over sets of infinite measure, het S 
be a Borei set of infinite measure, and denote by S a ,b the product 
(common part) of S with the closed interval (a, 6), where a and b are 
finite. Then S a> b is, ol course, of finite measure. 

If g(x) is integ^rable over S a ,b ^ot all a and fc, and if tbe limit/ 

45 



5.5 


lim j \g(x)\dx= f \g(x)\ dx 

fl 00 Q Q 

exists and has a finite value, we shall say that g(x) is integrablc 
over S. x ) It is easily seen that in this case the limit 

(5.5.1) lim J g(x)dx— f g (x) dx 

0,6 « 

also exists and has a finite value, and we shall accordingly say that 
the Lebesgue integral of g(x) over the set S is convergent 1 ). The limit 

(5.5.1) is then, by definition, the value of this integral. 一 If g(x) is 
integrable over S t it is also integrable over any subset of S. 

If \g{^) \ < 6r(x) for all x in S, where G (x) is integrable over 
it is easily seen that g(x) is integrable over S. Since | A + 夕 ， | S 
=I I + I I» it follows that the sum of two integrable functions is 
itself integrable. 

It follows directly from the definition that the properties (5.3.1), 

(5.3.2) and (5.3.4) hold true in the case of functions integrable over a 
set of infinite measure. Instead of (5.3 3), we obtain here only the 
inequality 

f g(x)dx ^0 if g(x)^0 for all ^ in/S. 
s 

This is, however, sufficient for the deduction of (5.3.5) for any inte¬ 
grable g(x). 

We now proceed to the generalization of (5.4.5), which is itself a 
generalization of (5.3.6). If lim g v (x) — g (rr) almost everywhere in S, 
and if I < G, where G is integrable over 5, it follows as in the 
preceding paragraph that \g\ ^ G almost everywhere in S. Conse¬ 
quently g(x) is integrable over and we can choose a and b such 
that for all v 

J \gr — g\dx <2 f G(x)dx < ie. 

s ^ s a,b ^-s Qih 

Now S a ,b is of finite measure, and it then follows from the proof of 
(5.4.5) that we can choose n such that for all v^n 

f \g^~g\dx<\e. 

Sa,b 

、) Strictly speaking, 、ve ought to say that g{x) is absolutely integrable over S ， 
and that the integral of g{x) over S is absolutely convergent. As we shall only in 
exceptional cases use non absolutely convergent integrals we may, however, without 
inconvenience use the simpler terminology adopted in tbe text. 


46 



5.5-6 


We then have for v hi 

f — = f + / < £. 

s s a,b 5 ~ 5 a,h 

Since € is arbitrary, we have thus proved the following theorem, which 

contains (5.3.6) and (6.4.5) as particular cases : 

If lim (x) — g (a:) exists almost everywhere in the set S of finite or 
*-•00 

infinite measure, and if | g v (rr) | < (i (x) for all v and for all x in S y 
where G(x) is integrable over S、then g(x) is integrable over S, and 

(5.5.2) lim f g v (x)dx = f g(x)dx. 

V-^CO s S 

The theorem (5.5.2) may, of course, also be stated as a theorem on 
term-bj term integration of series analogous to (5.3.7). 一 Finally, the 
argument used for the proof of (5.3.8) evidently applies in the present 
case and leads to the following generalized form of* that theorem : 
If g (x) is integrable over S f and if S = S x + S 2 . ， tvhere S v = 0 
for fi ^ v f then 

00 

(5.5.3) J g(x)dx = ^ J g (x) dx. 

s i s v 

5.6. The Lebesgue integral as an additive set function. 一 Let us 
consider a fixed non negative function f(x), integrable over any finite 
interval, and put for any Borel set S 

I f f(x)dx, if f{x) is integrable over S y 

s 

+ oo otherwise. 

Then P{S) is a non-negative function of the set S y uniquely defined 
for all Borel sets S. Let now — /S, -f + • , where S^Sy = 0 for 
和 〆 y. It then follows from (5.5.3) that the additivity relation 

P(S) = P(5 1 )4-P(S,) + . 

holds as soon as P(S) is finite. The same relation holds, however, 
even if P(S) is infinite. For if this were not true, it would be pos* 
sible to choose the sets S and S u S t 、 ... such that P(S )== 十 oo, while 
the sum P(5,) + P(<S 2 ) + would be finite. This would, however, 
imply the relation 


47 



S.6—6.1 


J/(x) da: = 2 / dx 

s «.b > (*,)o,» 

is, 1 

Allowing here a and b to tend to their respective limits, it follows 
that f(x) would be integrable over S } against our hypothesis. Thus 
P(S) as defined by (5.6.1) is a non-negative and additive set function t 
defined for all Boi el sets S in R { . 

In the particular case when f(x) = 1, we have P(S) — L (S), so that 
尸 (5) is identical with the Lebesgue measure of the set S. Another 
important particular case arises when f(x) is integrable over the whole 
space R t . In this case, P(S) is always finite, and we have for any 
Borel set S 

' P(S)^ff(x)dx. 

- 00 

CHAPTER 6. 

Non-Negative Additive Set Functions in R x . 

6.1. Generalization of the Lebesgue measure and the Lebesgue 
integral. — In Ch. 4 we have determined the Lebesgue measure L (S) 
for any Borel set S. L(S) is a number associated with S or, as we 
have expressed it, a function of the set S. We have seen that this set 
function satisfies the three conditions of 4.2, which require that L (S) 
should be a) non-negative, b) additive, and c) for any interval equal 
to the length of the interval. We have finally seen that L(S) is the 
only set function satisfying the three conditions. 

On the other hand, if we omit the condition c), L(S) will no longer 
be the only set function satisfying our conditions. Thus e. g. the func¬ 
tion P(S) defined by (5.6.1) satisfies the conditions a) and b), while c) 
is only satisfied in the particular case f(x) = 1, when P(S) = L{S). — 
Another example is obtained in the following way. Let x x , x,, ... be 
a sequence of points, and p Jt .. . a sequence of positive quantities. 
Then let us put for any Borel set S 

p{s) = 2iP' 

<zs 


48 



6.1-2 


the sum being extended to all x v belonging to S. It is readily seen 
that the set function P(S) thus defined satisfies the conditions a) and 
b), but not c). 

We are thus led to the general concept of a no” negative and ad¬ 
ditive set function, as a natural generalization of the Lebesgue measure 
L(S). In the present chapter we shall first, in the paragraphs 6.2 — 
6.4, investigate some general properties of functions of this type. 

In the applications to probability theory and statistics, that will be 
made later in this book, a fundamental part is played by a particular 
class of non-negative and additive set functiovs. This class will be con¬ 
sidered in the paragraphs 6 5 一 6.8. 

In the following Chapter 7, we shall then proceed to show that 
the whole theory of the Lebeggue integral may be generalized by re¬ 
placing, in the basic de6nition (5.1.1) of the Darboux sums, the Lebesgne 
measure L(5) by a general non-negative and additive set function P(S). 
The generalized integral obtained in this way, which is known as the 
Lebesgiie Sticltjes integral, will also be of a fundamental importance 
for the applications. 

6.2. Set functions and point functions. 一 We shall consider a set 
function P(5) defined for all Borel sets S and satisfying the following 
three conditions: 

A) P(5) is non negative: P(S) ^ 0. 

B) P(S) is additive: 

P(S t + 十 )=P(^) + P(5,) + - (SpS, = 0/or 

C) P(S) is finite for any bounded set S. 

All set functions considered in the sequel will be assumed to satisfy these 
conditions. 

From the conditions A) and B), which are the same as in the par¬ 
ticular case of the Lebesg：ue measure L(/S), we directly obtain certain 
properties of P(S) t which are proved in the same way as the cor¬ 
responding properties of £(5). Thus if S } < 5, we have 

(6.2.1) PiS,) ^ P(S t ). 

For the empty set we have P(0) = 0. If S l} S ( . ... are sets which 
may or may not have common points, we have (cf 4.3.7, which ob¬ 
viously bolds for anj Borel sets) 


49 



6.2 

(6.2.2) P ㈨+ \ 十 . ）S P(S,) + + 

For a non-decreasing sequence S t 、 S t 、 •. .， we have (cf 4.7.1) 

(“ .2.3) lim P^^Pllim.Sn). 

For a non-increasing sequence, the same relation boldd provided that 
P(5,) is finite. 

When a set S consists of all points 5 that satisfy a certain rela¬ 
tion, we shall often denote the value P(S) simply by replacing the 
sign S within the brackets by the relation in question. Thus e g. if 
S is the closed interval (a, 6)，we shall write 

P(S) - P(« ^ ^ b). 

When S is the set consisting of the single point $ = a, we shall write 

P(S)-P(g-a), 
and similarly in other cases. 

We have called P(<S) a set function % since the argument of this 
function is a set. For an ordinary function ，…， x n ) of one or 
more variables, the argument may be considered a point with the 
coordinates x,, . . x n , and we shall accordingly often refer to such a 
function as a point function. —— When a set function P(S) and a con¬ 
stant k are given, we define a corresponding point function F(x\ A*) 
by putting 

(P(k < § ^ x) for x> k } 

(6.2.4) F{x\ k) = 0 » 

(— P(rr < f ^ A：) » x < k. 

Whatever the value of the constant parameter k, we then find for any 
finite interval (a, 5) 

F(6; h)- F(a\ ife) = P(a < ? S ft) 5 0, 

which shows that F(x\ A:) is a non decreasingr function of x. If in the 
last relation we allow a to tend to 一 or ft to tend to + « ， or 
both, it follows from (6.2 3) that the same relation holds also for in¬ 
finite intervals. 一 In the particular case when P(S) is the Lebes^ue 
measure L(S), we have F(x\h) = x — Jc. 

The functions F(x\ k) corresponding to two different values of the 
parameter k differ by a quantity independent of x. In fact, if < / 2 
we obtain 


50 



6.2 


F(x; k } ) - F(x; k t ) = P(k x < g ^ k 9 ). 

Thus if we choose an arbitrary value k 0 of k and denote the corres¬ 
ponding function F{x\ k 0 ) simply by F(x\ any other F(x\ k) will be 
of the form F(x) + const. 

We may thus say that to any set function P(S) satisfying the condi- 
tions A)—C), there corresponds a non-decreasing point function F(x) such 
that for any finite or infinite interval (a, b) we have 

(6.2.5) F(b)-F(a)^P(a<^b). 

F(x) is uniquely determined except for an additive constant. 

We now choose an arbitrary, but fixed value of the parameter k, 
and consider the corresponding function F(x). Since F(x) is non¬ 
decreasing, the two limits from above and from below 

F(a + 0) = lim F(x), F(a 一 0) = lim F(x) 

z-*a—0 

exist for all values of a f and F(a — 0) ^ F(a -f 0). According to 
(6.2.5) we have for a: > a 

F(x)- F(a) = P(a < g ^ x). 

Consider this relation for a decreasing sequence of values of x tending 
to the fixed value a. The corresponding half-open intervals a < ^ ^ x 
form a decreasing sequence of sets, the limiting set of which is empty. 
Thus by (6.2.3) we have F(x) 一 - jP(a) — 0 ， i. e. 

F(a + 0) = F(a). 

On the other hand, ior x < a 

F(a)- F(x) 

and a similar argument shows that 

F(a — 0)« F(a) - P(§=a)<i F(a). 

Thus the function F(x) is always continuous to the right For every 
value of x such that P (§ == x) > 0, F(x) has a discontitmity with the 
saltus P(5 = x). For every value of x such that P(§ = u?) = 0, F(x) is 
continuous. 

Any x such that P(/S) takes a positive value for the set S con¬ 
sisting of the single point x, is thus a discontinuity point of F(ar). 

51 



6.2 


These points are oftlied discontinuity points also for the set function 
P(S), and any continuity point of F(x) is also called a continuity 
point of P(S). 

The discontinuity points of P(5) and JF(x) form at most an enumer¬ 
able set. — Consider, in fact, the discontinuity points x belonging to 

the interval i n defined by n < x ^ n l t and such that P(J==x) > ^ 

Let S r be a set consisting of any v of these points, say x lt . . , x 9 . 
Since S % is a subset of t；he interval i n , we then obtain 

P(u) a ^(5,) = P(? -J：,) + + p(| = 

or v < cP(/«) Thus there can at most be a finite number of points 
x } and if we allow c to assume the values 0 = 1 , 2 ,..., we find 
that the discontinaity points in t n form at most an enumerable set. 
Summing ： over w = 0, ± 1， ± 2 . . •，we obtain (cf 1.4) the proposition 
stated. 

Let now x, t x t> . be all discontinuity points of P(S) and F(x )， 
let X denote the set of all the points x v , and put P(g = ar*) = p» 
For any set S t the product set S X consists of all the points be¬ 
longing to 5, while the set 5 — 5 X = 5 X* contains all the remaining 
points of S. We now define two new set functions P, and P % by writing 

(6 2 6) P x (S) = P(SX)^ P i (S) = P(SX*). 

CS 

It is then immediately seen that P, and P t both satisfy our conditions 
A) 一 C) Further, we have S = S X + S X*, and hence 

(6.2.7) P(5) = P 1 (5)+ P,(5) 

It follows from (6.2.6) that P t (S) is the sum of the saltuses p y for 
all discontinuities x v belonging to 5. Thus P v (S) = 0 for a set S 
which does not contain any x v . On the other hand, (6.2.6) shows that 
P 8 (S) is everywhere continuous, since all points belonging to X* are 
continuity points of P(5). Thus. (6.2.7) gives a decomposition of the 
non negative and additive set function P(5) in a discontinuous part 
P,(S) and a cotitinuous part P 9 (S). 

If F, F x and F t are the non-decreasing point functions corres¬ 
ponding to P y P x and P v and if we choose the same value of the 
additive constant k in all three ca«es, we obtain from (6.2.4) and (6.2.7) 


52 



6.2~3 

(6.2.8) /» = ^(x) + F t (4 

Here, F t is everywhere continuous, while f\ it a »step-function» f 
which is constant over every interval free from the points x 9% but has 
a »8tep» of the height in every x v . — It is easily seen that any 
non-decreasing function F(x) may be represented in the form (6.2.8), 
as the sum of a step-function and &u everywhere continuous function, 
both non-decrea8in 客 and uniquely determined. 

6.3. Construction of a set function. — We shall now prove the 
following converse of theorem (6.2.5): 

To any non decreasing point function F(x) y that ts finite for all finite 
x and is always continuous to the right、ther? corresponds a set function 
P(S), uniquely determined for all Borel sets S and satisfying the co”' 
ditions A)—C) of 6.2, in such a way that the relation 

P ⑻一 = 

holds fm % any finite or infinite interval (a, 6). 一 It is then evident that 
two functions F t (x) and F t (x) yield the same P(S) if and only if the 
difference F x — F t is constant. 

Comparing this with theorem (6.2.5) we find that, if two functions 
F x and F % differing by a constant are counted as identical, there is 
a one-to-one correspondence between the set functions P(<S) and the 
non-decreasing^ point functions JP(a?). 

In the first place, the non-decreasing point function F(x) deter¬ 
mines a non-negative interval functicn) P(i), which may be defined as 
the increase of F(x) over the interval i. For any half-open interval 
defined by a < x S b，P(i) assumes the value P{a < x ^ b)=^F{b)—F(a). 
For the three other types of intervals with the same end-points a and 
b we determine the value of P(i) by a simple limit process and thus 
obtain 

P(a ^ x ^ b)^ F(b) 一 F(a — 0), 

P(a<x<b) = F(b — 0) — F(o), 

(6.3.1) 

P(a<x^b) = F(b)-F{a). 

P(a ^ x b) = jF(b 一 0) — Z f1 (a 一 0), 

bo that P{t) is completely determined for any interval i. 

The theorem to be proved asserts that it is possible to find a non¬ 
negative and additive set function, defined for all Borel sets 5, and 
equal to P(t) in the particular case when -S 1 is an interval ?. 

53 



6.3 


This is, however, a straightforward ^eneraliBfttion of the problem 
treated in Ch. 4. In that chapter, we ha?e been concerned with the 
particular case F(a?) = x } and with the corresponding interval function: 
the length L (t) of an interval i. The whole theory of Lebes^ue 
measure at developed in Ch. 4 consists in the construction of a non- 
nesrative and additive set function, defined for all Bor el sets S and 
equal to L(i) in the particular case when S is an interval i. It is 
now required to perform the analogous construction in the case when 
the length or »L-measure^ of an interval, L(i) = h — a 、has been replaced 
by the more general » P-measure^ P(i) defined bj (6.3.1). 

Now this may be done by exactly the same method as we have 
applied to the particular case treated in Ch.. 4. With two minor ex¬ 
ceptions to be discussed below, every word and every formula of Ch. 
4 will hold good, if 1) the words measure and measurable are throughout 
replaced by P-measure and P-measurable, 2) the length L (i) = b 一 a 
of an interval is replaced by the Pmeasure P(t), and 3) the signs L 
and 2 are everywhere replaced by P and In this way, strictly foliowiDg 
the model set out in 4.1 一 4.5, we establish the existence of a non negative 
and additive set function P(5), uniquely defined for a certain class $ of 
sets that are called P-measurable y and equal to P(t) when S is an inter¬ 
val i. Further, it is shown exactly as in 4.6 that the class $ of all P- 
measurable sets is an additive class and thus contains all Borel sets. 
Finally, we prove in the same way as in 4 7 that P(S) is the only 
non-negative and additive set function defined for all Borel sets, 
which reduces to the interval function P(t) when S is an interval. 

In this way, our theorem is proved. Moreover, the proof explains 
why it will be advantageous to restrict ourselves throughout to the 
consideration of Borel sets. We find, in fact, that although the class 
of all immeasurable sets may depend on the particular function 
which forms our starting; point, it always contains the whole class S3, 
of Borel sets. Thus any Borel set is always P-measurable, and the 
set function corresponding to any given F(x) can always be 

defined for all Borel sets. 

It now only remains to consider the two exceptional points in 
Ch. 4 referred to above. The first point is very simple, and is not 
directly concerned with the proof of the above theorem. In 4.3 we 
have proved that the Lebesgue measure of an enumerable set is always 
eqaal to zero. This follows from the fact that an enumerable set may 
be considered as the sum of a sequence of degenerate intervals, each 
of which has the length zero. The corresponding proposition for P- 

54 



6.3 


measure is obyiouslj false, m soon as the function F(o?) has at least 
one diseontinuitj point. A degenerate interval consisting of the single 
point a maj then well hare a positire P-measure, since the first rela¬ 
tion (6.3.1) gives 

P(x = a) =* F(a) — F(a — 0). 

At soon m im enumertble set coutains at least one discontinuity point 
of JP(aj), it has thus a positive P-measure. 

The second exceptional point arises in connection with the gener¬ 
alisation of paragraph 4.1, where we have proved that the length is 
an additife interval function. In order to prove the same proposition 
for P-measure, we have to show that 

(6.3.2) + + …， 

where i and t u i v . . . are intervals such that i *= i x + t, + • and 
tfi tv s = 0 ior fi 7^ v. 

For a continuous this is shown by Borer® lemma exactly in 

the same way as in the case of the correaponding relation (4.1.1), re¬ 
placing throughout length by P-measure. Let us t however, note that 
in the course of the proof of (4.1.1) we have considered certain inter¬ 
vals, e.fg. the interval (a — e, a -f c) which is chosen so as to make 
its length equal to 2e. When generalizing this proof to P-measnre, 
we should replace this interval bj (a — k, a 4- A), choosing h such 
that the P-measure F(a -f h) —* F(a — h) becomes equal to 2«. 

On the other hand, if jP(x) is a step function possessing in i the 
discontinuity points x ly x ty . . . with the respective steps p u we 

have 

1 C» 

Since no two of the i n have a common point ； every x r belongs to 
exactly one *.”，and it then follows from the properties of convergent 
double series that (6.3.2) is satisfied. 

Finally, by the remark made in connection with (6.2.8) any F(x) 
is the sum of a step-function F x and a continuoas component F if both 
non-decreasing. For both tb«se functions, (6.3.2) bolds, and thus the 
same relation also holds for their sum F(x). 一 We have thus dealt 
with the two exceptional points arising ： in the course of the general¬ 
isation of Ch. 4 to an arbitrary P-metsnre, and the proof of our 
theorem is hereby completed. 


55 



6.4-6 


6.4. P-meature. 一 A set function P(S) satisfying the conditions 
A)—C) of 6.2 defines a P-measurc of the set S, which conttitutes a 
generalization of the Lebesg：ue measure L(S). Like the latter, the 
P-measure is non-negative and additive. 

By the preceding paragraph, the P-measure is uniquely determined 
for any Borel set S y if the corresponding non-decreasing point func¬ 
tion F(x) is known. Since, by 6.2, F(x) is always continuous to the 
right, it is sufficient to know _F(a?) in ail its points of continuity. 

If, for a set /S, we have P(S) = 0, we shall say that S is a set of 
P measure zero. By (6.2.1), any subset of S is then also of P-measure 
zero. The sum of a sequence of sets of P-meaaure zero is, by (6.2.2), 
itself of P-mea8ure zero. If jP(a) = ^*(6), the half open interval 
a < x ^ b 18 of P-measure zero. 

When a certain condition is satisfied for all points belonging to 
some set S under consideration, except possibly certain points forming 
a subset of P-measure zero, we shall say (cf 5.3) that the condition 
is satisfied almost everywhere (P) or for almost all (P) points in the 
set S. 


6.5. Bounded set functions. — For any Borel set S we have by 

(6.2.1) P(S) ^ P(H t ). If P(K,) is finite, we shall say that the set 
function P(S) is bounded. When P(S) is bounded, we shall always fix 
the additive constant in the corresponding non-decreasing point func¬ 
tion ^(x) by taking A: = —- oo in (6.2.4)，so that we have for all values 
of x 

(6.5 1) F(x) = P(5 ^ x). 

When x tends to 一 co in this relation, the set of all points x 
tends to a limit (cf 1.5)，which is the empty set. Thus by (6.2 3) we 
have ^(― oo) = 0. On the other hand, when x + oo, the set § ^ ^ 
tends to the whole space R iy and (6.2.3) now gives jP (+ oo) = P(jR,). 
Since F[x) is non-decreasing, we thus have for all x 

(6.5.2) 0 ^ F(x) ^ P(R { Y 

6.6. Distributions. 一 Non-negative and additive set functions P(S) 
such that P(R t ) =* 1 play a fundamental part in the applications to 
mathematical probability and statistics. A function P(/S) belonging 
to this class is obviously bounded, and the corresponding 1 non-decreasing 
point function _F(x) is defined by (6.5.1), so that 

56 



6.6 

F(x)^P(^x) t 

(6.6.1) 0< F(x) ^ 1, 

jP ( 一 00) = 0 ， _F(+ 00)=1 

A pair of functions P(S) and of this type will often be con¬ 
cretely interpreted by means of a distribution of mass ovei' the one- 
dimensional space R r Let us imagine a unit of mass distributed over 
H] in such a way that for every x the quantity of mass allotted to 
the infinite interval $ ^ x is equal to F(x). The construction of a set 
function P(5) by means of a given point function F(x) % as explained 
in 6.3, may then be interpreted by saying ： that any Borel set S will 
carry a determined mass quantity P(S). The total quantity of mass 
on the whole line is P(B. t ) — 1. 

We are at liberty to define such a distribution either by the set 
function P(S) or bj the corresponding point function F(x). Uaing 1 
a terminology adapted to the applications of these concepts that will 
be made in the sequel, we shall call P(S) the probability function of 
the distribution, while F(x) will be called the distribution function. 

Thus a distribution function is a non-decreasing point function 
F(x) which is everywhere continuous to the right and is such that 
1^( — 30) = 0 and F(-f oo) = 1. Conversely, it follows from 6.3 that 
any fiven F(x) with these properties determines a unique distribution, 
having JP(r) for its distribution function. 

If x 0 is a discontinuity point of •F(rc), with a 8altuB equal to p 0> 
the mass ]) 0 will be concentrated in the point a: 0 , which is then called 
a discrete mass point of the distribution. On the other hand, if x 0 
is a continuity point，the quantity of mass situated in the interval 
(x — h, o ; 十 ％ will tend to zero with A. 

The ratio —is the mean density of the mass be¬ 
longing to the interval x^-h<^^x + h. If the derivative F* (ar) =/(x) 
exists, the mean density tends to f[x) as h tends to zero, and accord¬ 
ingly f(x) represents the density cf mass at the point x. In the ap¬ 
plications to probability theory, f[x) will be called the ptcbability 
density or the frequency function of the distribution. Any frequency 
function f(x) is non-negative and has the integral 1 over (— oo, oc )• 

From (6.2.7) and (6.2.8) it follows that any distribution may be 
decomposed into a discontinaous and a continuous part by writing 1 

57 



6.6-7 


P(S)^C { PAS)^c f P t (Sl 
(66 ' 2) F(x)^e l F l M + c t F,(x). 

Here c x and c, are non-negative constants such that c, + c 2 = 1. P, and 
F x denote the probability function and distribution function of a dis 
tribution, the total mass of which is concentrated in discrete mass 
points (thus F y is a step-function). P 2 and F 2 , on the other hand, 
correspond to a distribution without any discrete mass points (thus 
F t is everywhere continuous). The constants c 1 and r 2 , aa well as the 
functions P,, P,, F, and F t are uniquely determined by the given 
distribution. 

In the extreme case when c, = l, c 2 = 0, the distribution function 
F(x) i8 a step-function, and the whole mass of the distribution is con¬ 
centrated in the discontinuity points of F(x)^ each of "which carries 
a mass quantity equal to the corresponding saltus. The opposite ex¬ 
treme is characterized by r, = 0, c 8 = 1, when F(x) is everywhere con¬ 
tinuous, and there is no single point carrying a positive quantity of mass. 

In Ch. 15 we shall give a detailed treatment of the general theon 
of distributions in Rj . In the subsequent Chs. 1(3 — 19, certain im¬ 
portant fecial distributions will be discussed and illustrated by figures. 
At the present stage, the reader may find it instructive to consult 
Figs 4 一 5 (p. 169)，which correspond to the case c, = 1, c t = 0, and 
Figs 6 — 7 (p 170 一 171), which correspond to the case c, = 0, r 2 = 1. 

6.7. Sequences of distributions. — An interval (a, b) will be called 
a continuity tnterval for a given non-negative and additive set function 
P(S), and for the corresponding point function F(r), when both ex¬ 
tremes 1 ) a and b are continuity points (cl 6.2) of P(5) and i 5 ^). If 
two set functions agree for all intervals that are continuity intervals 
for both, it is easily seen that the corresponding point functions F(x) 
differ by a constant, so that the set functions are identical. 

Consider now a sequence of distributions, with the probability func¬ 
tions P, (S), P% (S) t ... and the distribution functions F, (a:), F t (a:), - 

We shall say that the sequence is convergent, if there is a non-negative 
and additive set function V{S) such that P n (5) P(S) whenever S 
is a continuity interval for P(S). 

Since we always have 0 ^ P n (-S) ^ 1, it follows that for a con¬ 
vergent sequence we have 0 ^ P(5) ^ 1 for any continuity interval 

*) Note that any inner point of the interval may be a discontinuity. Th« name 
of continuity-bordered interval, though longer, would perhaps be mow adequate. 

58 



4.3 


S ** (f/, 6). When a — — oc and b ■•■ + oo ， it then follows from (C 2.3) 
that P(R t ) ^ 1. — The case when P(R t )=z] i 8 of special interest. 
In this case P{S) is the probability function of a certain distribution, 
and we shall accordingly say that our sequence convei-ges to a distribu 
tion. viz. to the distribution corresponding to F(S). 一 Usually it is 
only this mode of convergence that, is interesting in the applications, 
and we shall often want a criterion that will enable us to decide 
whether a given sequence of distributions converges to a distribution 
or nofc. The important problem of finding such a criterion will be 
solved later (cf 10.4); for the present we shall only give the following 
preliminary proposition : 

A sequence of distributions with the distribution functions F x (ar), 
jF a (a;), ... converges to a distribution ivhen and only when there is a dis. 
tribution function F{x) such that (x) —»F (a:) in every continuity point 
of F(x). — When such a function F(x) exists ， F(x) is the distribution 
function corresponding to the limiting distribution of the sequence、and 
we shall briefly saif that the sequence { F n (x )} converges to the distribu¬ 
tion function F(x). 

We shall first show that the condition is necessary, and that the 
limit F(x) is the distribution function of the limiting distribution. 
Denoting as usual by 1\ (S) the probability function corresponding to 
F n (x) y we thus assume that J J n (S) tends to a probability function P(S) 
whenever is a continuity interval (a, b) for P(S). Denoting* by F(x) 
the distribution function corresponding to P ⑹， we have to show that 
F» (ar) -► F (ar), where x is an arbitrary continuity point of /"'(a:). Since 
P(H,) = 1, we can choose a continuity interval S — (a, //) including a; such 
that P(S) > 1 一《， where c > 0 is arbitrarily small. Then 1 — € < P(S) = 
=F(b) — F(a) ^ 1 — F(a)^so that 0 S F(a) < e. Further, we have by hypo¬ 
thesis F„ ⑹一 _F„ (a)-^F{b) — _F(o：) > 1 — e, so that for all sufficiently large u 
we have F„(b) — F v (a)> 1 —■ 2 e, or 0 ^ F n (a) < F n (b)— 1 -f 2« ^ 2«. 
Since (a, x) is a continuity interval for P (S), we have by hypothesis F n W — 
— F n (a) - ► F(x) — F(a). For all sufficiently large n we thus have | F n (sc) — 
—F (x) — F n (a) + F(a)| <e, and hence according to the above 
I F n (x) — F(x) I < 3fi. Since e is arbitrary, it follows that F n (x) 

Conversely, if we assume that F H (x) tends to a distribution function 
F(x) in every continuity point of F(x), and if we denote by the 
probability function corresponding to F(a?), it immediately follows that 
^»(6) — F n (a) -► i r (6) — F(a) 1 i. e. that JP n ⑹— P(S), whenever 5 is a 
half-open continuity interval a< x^b for P(S). Further, since F(x) 

59 



6.7-8 


is never decreasing and continuous for x = a and x== b, it follows 
that F n (a —0) F(a) and F„ (fc — 0) — F(b). Hence we obtain the same 
relation P n (S) P(S) whether the continuity interval S == (a, 6) is re¬ 
garded as closed, open or half-open. Thus the proposition is proved. 


In order to show by ail example that a sequence of distributions may converge 
w ithoat converging 1o a distribution, we consider first the distribution which bus the 
whole mass unit placed in the single point x — 0. Denoting the corresponding distri¬ 
bution function by s(x), we have 


(6.7.1) 



for x < 0, 
for j; ^ 0. 


Then €(j*— a) is the distribution function of a distribution which has the whole mnss 
unit placed in the point or» a. Consider now the sequence of distributions defined 
by the distribution functions F n {x) — e(x — n), where n = 1, 2, Obviously this se¬ 
quence is conrergent according to the ft bo ve definition, since the mass contained in any finite 
interval tends to zero as n — «• The limitiDg set function is, however, identirnlly 
equal to zero, and is thus not a probability function When », the mass in our 
distributions disappears, as it were, towards + «. 

It might perhaps be asked why, in our convergence definition, we should not re¬ 
quire that P H (S)P(S) for every It is, however, easily shown that this 

would be a too restrictive definition Consider, in fact, the sequence of distributions 
deAned by the distribution functions e(x — 1/n), where « = 1, 2, .... The n th distri¬ 
bution in this sequence has its whole moss unit placed in the point i = 1/n. It is 
evident that any reasouable convergence definition must be such that this sequence 
converges to the distribution deflned by (0.7.1), where the whole muss unit; is placed 
in x = 0. It is easily verified that the convergence definition given above satisfies 
this condition. If, on the other hand, we consider the set S containing the single 
point r = 0, onr sequence gives P n (S) =* 0 for every n, while for tlie limiting distri¬ 
bution we have P(5)« 1, so that P n (S) does certainly not tend to P(S). Accord¬ 
ingly the distribution function €(* — 1/w) tends to «(x) in every continuity point of 
€(x), i. e. for any x 0, but not io the discontinuity point x — 0 


6.8. A convergence theorem. — A sequence of distribution func¬ 
tions F^x), F % {x\ ... is said to be convergent, if there is a non.de- 
creasing function F(x) such that F n (x) ^ F(x) in every continuity 
point of F(x). We then always have 0 S F(x)^\, but the example 
F n (a?) = «(a? —w) considered in the preceding paragraph shows that F(x) 
is not necessarily a distribution function. Thus a sequence { F n (a:) | 
may be convergent without converging to a distnhution function. 一 We 
shall now prove the following proposition that will be required in the 
sequel: Every sequence {JFnW} of distribution functions contains a con¬ 
vergent sub sequence. The limit can always be determined so as to 
be everywhere continuous to the right. 

60 



6.8 


Let r!. r,，• •. be the enumerable (cf 2.2) set of ail ^ositire and 
negative rational numbers, including zero，and consider the sequence 

F, (r,), JF f (r,). This is a bounded infinite sequence of real numbers, 

which by the Bolzano-Weierstrass theorem (2.2) has at least one limiting ： 
point. The sequence of numbers { F n (rj} thus always contains a con¬ 
vergent sub sequence. The same thin 辽 may also be expressed by saying 
that the sequence of functions {F n (x)} always contains a sub sequence 
Z, convergent for tbe particular value x = r,. By the same argument, 
we find that Z x contains a sub sequence Z % convergent for a: = r, and 
for x = r t . Repeating the same procedure, we obtain successively the 
sub-sequences Z Xl Z,，•. •，where Z n is a sub-sequence of Z„-i, and Z n 
converges for the particular values x = r,, ••” r n . Forming finally 

the »dia^onal» sequence Z consisting of tbe first member of Z x> the 
second member of Z % , ..it is readily seen that Z converges for every 
rational value of x. 

Let the members of Z be F„, (x), F w ,(x), ..and put 

lira F nr (r,) -- c t (/ = 1, 2,...). 

• 00 

Then {(?,} is a bounded sequence, and since every F„ 9 is a non de¬ 
creasing function, it follows that we have c t ^ Ck as soon as r, ^ n . 
Now we define a function F(x) by writing 

F[x) = lower bound of d for all r» > x. 

It then follows directl】. from the definition that F{pc) is a bounded 
non-decreasing function of x. It is also easily proved that F(x) is 
everywhere continuous to the right. We shall now show that in every 
continuity point of JP(o?) we have 

(6.8.1) lim F„ ¥ (z) *= 

♦ - • 00 

so that the sub-sequence / is convergent. 

If x is a continuity point of F(x) we can，in fact, choose h>0 
such that the difference F(x + h) — F(x 一 h) is smaller than any given 
* > 0. Let r, and r* be rational points gituftted in the open interval* 
(x — h ， x) and (x, x + A) respectively, so that 

(6.8.2) F(x - fc) ^ c, < F(x) F(x + h). 

Further, for every v we have 

(6.8.3) Fnjn) ^ F m Jx) ^ F n Jr k ). 

61 



6.8-7.1 


As v tend8 to infinity, Fn v (>v) and Fn v (r*) tend to the limits a and Ck 
respectively. The difference between these limits is, according to (6.8.2), 
smaller than e t and the quantity F(x) is included between r, and c*. 
Since e is arbitrary, it follows that F nv (x) tends to F(x). Thus the 
sub-sequence Z is convergent, and our theorem is proved. 


CHAPTER 7. 


The Lebesgue-Stieltjes Integral for Functions of 
One Variable. 

7.1. The integral of a bounded function over a set of finite P- 
measure. 一 In the preceding* chapter, we have seen that the theory 
of Lebesgue measure given in Ch. 4 may be generalized by the in¬ 
troduction of the concept of a general non-negative and additive P- 
measure. We now proceed to show that an exactly analogous gene¬ 
ralization may be applied to the theory of the Lebesgue integral 
developed in Ch. 5 

Let us assume that a fixed P-measure is given. This measure may 
be defined by a non-negative and additive set function P(5), or by 
the corresponding non-decreasing point function F(x). We have seen 
in the preceding chapter that these two functions are perfectly equi¬ 
valent for the purpose of defining the P-measure 

Let further g(x) be a given function of x, defined and bounded 
for all x belonging to a given set S of finite P-measure. In the same 
way as in 5 1, we divide S into an arbitrary finite number of parts 
S u S t ”. , S n » no two of which have a common point. In the basic 
definition (5.1.1) of the Darbonx sums, we now replace X-measure by 
P-measure, and so obtain the generalized Darhoux sum a 

(71" z = Z = 2iM,P{S,\ 

1 1 

where, as in the previous case, m v and M v denote the lower and upper 
bounds of g(x) in S„. 

The further development is exactly analogous to 5 1 The upper 
bound of the set of all possible ^-values is called the lower integral of 
g(x) over S with respect to the given P-measure, while the lower 
bound of the set of all possible Z-values is the corresponding upper 

62 



7.1 


integral. As in 5.1 it is shown that the lower integral is at most equal 
to the upper integral. 

If the lower and upper integrals are equal, g(x) is said to be inte- 
grable over S with respect to the given P-measttre, and the common value 
of the two integrals is called the Lebesgue StieUjes integral of g(x) over 
S with respect to the given P-measure, and is denoted by any of the 
two expressions 

/ g(x)dP(S) = f g (x) d F(x). 
s 

When there is no risk of a miaunderstandin 餐 ， we shall write simply 
dP and dF instead of dP(S) and dF(x). Instead of integral or inte- 
grable with respect to the given P-measure, we shall usually say with 
respect to P(S), or with respect to F(x) t according as we consider the 
P-measure to be defined by P(5) or by F(x). As long as we are 
dealing with functions o! a single variable, we shall as a rule prefer 
to use F(x). 

In the particular case when F(x) = x y we have P(S) = L(S) t and 
it is evident that the above definition of the Lebesgue-Btieltjes integral 
reduces to the definition of the Lebesgue integral given in 5.1. Thus 
the Lebesgue-Stieitjes integral is obtained from the Lebesgue integral 
simply by replacing, in the definition of the integral，the Lebesgue 
measure by the more general P-meaaure. 

All properties of the Lebesgue integral deduced in 5.2 and 5.3 are 
now easily generalized to the Lebesgue-Stieltjes integral, no other 
modification of the proofs being required than the substitution of P- 
measure for L-measure. Thus we find that, if g(x) is bounded and 
im measurable in a set 5 of finite P-measure, then g(x) is integrable 
over S with respect to P(S). For bounded functions and sets of finite 
P-ineasure, we further obtain the following generalizations of relations 
deduced in 5.3: 

(7.1.2) fig^x) + g t {x s )(lF = f g x (x)dF + Jr/ 4 (x)dF f 

s s s 

(7.1.3) f cg(x)dF = c f g(x) dF y 

s s 

(7.1.4) m P(S) ^ fg(x)dF^MP(S) i 

s 

(7.1.5) f g(x)dF = f g(x)dF ± f g(x)dF, 

St + $t s, St 

63 



7.1 


(7.1.6) |j. 咖 ) d 叫忘 /UWIrfF ， 

s s 


where c is a constant,and M denote the lower and upper bounds 
of g(x) in S 、while <S, and S a are two sets without common points. 
It follows from (7.1.4) that the integral of a bounded function over a 
set of P-measure zero is equal to zero. Thus the value of an integral 
is not affected if the values of the function y (x) are arbitrarily changed 
over a set of P-measure zero. 

We also have the following proposition generalizing (5.3.6): If the 
sequence { g % (x )) is uniformly bounded in S, and if lim g v (x) = g (a:) 

¥-*O0 

exists almost everywhere (P) in then 
(7.1.7) lim l\g t .(x)dF= fg(x)dF. 

s S 


The analogous generalizations of (5.3 7) and (5.3.8) are obtained in the 
same way as in 5.3. 

If and c t are non negative constants, we easily deduce the fol¬ 
lowing- relation, which has no analogue for the Lebesgue integral : 


j lf(x)d(c l F l + F t ) = c, f g(x)dF l + c t j g(x)dF t . 

S 5 •、 


In the particular case when the set S consists of a single point 
x 0 , we obtain directly from the definition 

f g(x)dF = g (.r 0 ) P(x = x 0 ). 

^X=x 9 ) 

Consider now the case when F(x) is a step-function (cf 6.2) with 
steps of the height p r in the points x = x，，and denote the set of all 
points x v by X. Using the fact that the integral over a set of P- 
measure zero is equal to zero, and the generalization of (5.3.8) men¬ 
tioned above, we then obtain 

(7.1.8) f g(x)dF = f g(x)dF=2j f g{x)dF = ^uoix,). 

s SX x v CS (:=*,) X V CS 

In the further particular case when g (x) = 1, we have 

fdF = f dP= P(S) 
s s 


64 



7.1-2 


We shall often have to consider integrals, where the function g (a:) 
is complex-valued, say g{x) = a(x) + tb(x), where a(x) and b(x) are real 
and bounded in S. We then define the integral by writing 

/ g(x)dF= f a(x)dF + i f b(x)dF. 

S S 8 

All properties deduced above extend themselves easily to integrals of 
this type. For the relation (7.1.6), this eztengion is a little less ob- 
Tions than in the other cases, and will be shown here. Put 

f g(x)dF = re iv t 
s 

where r and v are real, and r ^ 0. The real part of the quantity 
\g[x)\ —e~~ 1v g[x) is always ^ 0. Consequently the real integral 


j(\g{x)\ — e^ lv g{x)) i 


/| 咖 |rfF-|/ 夕 (x)rfF| 

S b 


it 2 0, and this is equivalent to (7.1.6). 


7.2. Unbounded functions and sets of Infinite P-measure. — The 
extendoiu of the Lebesgue integral treated in 5.4 and 5.5 may be ap¬ 
plied in a perfectly analogous way to the Lebesgue-Stieltjes integral. 
In fact, every word and every formula of 5.4 and 5.5 hold good, if 
Lebe 呢 ue measure is throughout replaced by P-measure, and Lebesgue 
integrals are replaced by Lebesgue-Stieltjes integrals with respect to 
P(5) or F(x). 

Thus g(x) is called integrable with respect to P(S) 一 or ^(x) 一 
over a aet S of finite P-measure, if the limit (cf 5.4.1) 

lim / g a ,b(x)dP= fg(x)dP = fg[x)dF 
s s s 

exists and has a finite value. If this is the case, | ^ (x) | is also inte^able 
with respect to P over S. 

Further, when S is of infinite P-measure x ), g (x) is called integrable 
with respect to P 一 or F 一 over S, if (cf 5.5) g(x) is integrable 

k ) In the case of % bounded P{S) (e. g. when P(8) ii a probability function, cf 
6.6) there tat, of course, no sets of infinite P-meaaare. 

65 



7.2-3 


with respect to P — or F 一 over S a ,b for all a and b % and if the 
limit 

lim j\g{x)\dP=- j\g{x)\dP — j |y(x)|dF 

汉 s 

exists and has a finite value. If this is the case, the limit (cf 5.5.1) 

(7.2.1) lim f g(x)dP=^ f g(x)dP== f g(x)dF 

iZ+Z s s 

also exists and ia finite, and we shall accordingly say that the Le- 
bessfue-Btieltjes integral of g(x) with respect to P 一 or F —— over 
the set S is convergent 1 ). The limit (7.2.1) is then, by definition, the 
value of this integral. — If | 夕(怎 )| < (r(z) f where G(x) is integrable, 
then g(x) is itself integrable. 

The properties (7.1.2) — (7.1.6) of the Lebesgue-Stieltjes integral hold 
true for any functions integrable with respect to the given P-measure. 
In the case of a set S of infinite P-measure the relation (7.1.4) should, 
however, be replaced by 

f g(x)d F ^ 0 if g (x) ^ 0 for all x in S. 

s 

We finally have the following generalization of the proposition 
expressed by (7.1.7): If lim g v (x) = g(x) exists almost everywhere (P) in 
the set S of finite or infinite Pmeasure, and if \g v (x)\< G {x) for all v 
and for all x in S、where G(x) is integrable with respect to F over S, 
then g(x) is integrable with respect to F over S y and 

(7.2.2) lim f g v (x)dF= f g(x) dF. 

The generalization of the above considerations to the case of inte¬ 
grals with a complex-valued function g{x) is obvious. 

In the particular case when F(x) = x all our theorems reduce» of 
course, to the corresponding theorems on ordinary Lebesgue integrals. 

7.3. Lebesgue-Stieltjes Integrals with a parameter. 一 * We shall 
often be concerned with integrals of the type 

«(<) = / 0(x, t)dF(x), 

8 

*) With respect to the terminology, the same remark should be made here m in 
the case of (6.6.1). 


66 



7.3 


where t is a parameter, while S is a given set of finite or infinite 
P-mea8ure. We shall require certain theorems concerning continuity, 
differentiation and integration of sucli integrals with respect to t. In 
the particular case when F (a;) = x, these theorems reduce to theorems 
on Lebesgue integrals. 

We assume that g(x y t) is complex-valued and that, for every fixed 
t that will be considered, the real and imaginary parts are JB-measur- 
able functions of x which are integrable over S with respect to 
By (x), G t (x\ …， we denote functions which are integ^able over S 
with respect to F (a;). 

I) Continuity. ― If、for almost all (P) values of x in S f the func¬ 
tion g(x, t) is covtmuous with respect to t in the point t ==.t 0 , and if 、 
^or all t in some neighbourhood of t 0 , we have \g(x } <)| < G x (x) y then w ⑺ 
is continuous for t 二 f 0 、so that we have 1 ) 


(7.3.1) iim fg(x, t) d F(ar) = f g(x t t 0 ) d F(x). 

s a 


This is a direct corollary of (7.2.2). For any sequence of values 

t t , . belonging to the given neighbourhood and tending to t 0i 
the conditions of (7^2.2) are, in fact, satisfied if we take g ¥ (x) = g[x t 
and g(x) = g (x, t 0 ). ; Thus by (7.2.2) we have u(U) «(^ 0 ), and it fol¬ 
lows that the same relation holds when t tends continuously to t 0 . 一 
When the conditions of 1) are satisfied for all t 0 in the open interval 
{a, b), it is seen that m(<) is continuous in the whole interval. 

11) Differentiation. — If } for almost all (P) values of x in 8 and for 
a fixed value of t, the following conditions are satisfied. 

1) The partial derivative exists, 

1 2) We hove < Gi {x) forQ<\h\<K，where h 。 

is independent of x、then 

<7.3.2) «， (0 = g(x,t)dF[x)^ f d -l^dF(x). 

S H 


Like the preceding proposition, this is a direct corollary of (7.2.2). 
For any sequence h t ，…， where < h 0 and h v tends to zero, the 
conditions of (7.2.2) are satisfied if we take 


*) The theorem holds, with the same proof, even if t 9 U replaced by + ® or — 


67 



7.3 


9.(x) ■■ 


g(x,t + h,) — g(x,t) , dg(x, <) 

-- h ： and 9{x) — ~ IT. 


Thus 


*(< + = . : 也 / 9{^+Az.fLM dF(x) ^J d -M^dF(x), 


so that the deri ▼ 篡 tire t / ⑺ exists and Las the value given by (7.3.2). 
mark the 


We remark that, if the partial derivative ^ exists and satisfies the 


condition , ❽兔 

from the relation 


< G $ (x) for all t in the open interval (a, 6), it follows 


g(x i t-¥ h) — g(x t t)^h (0 < ^ < 1), 

that (7.3.2) holds for all t in (a, h). 

Mote thftt the condition 2) of II) is not latitfled e. g. if we Uke F[x) * x t 
S ■(— •，+ oo), and 

(J-* for x 么 t. 


0 


for x < t. 


In this case we have 


w(0 ^ f g(x,t)dx^ f c*"*dx = 1, 

—• t 

And the application of (7.8.2) would give 

«0 00 

11*(0 = rfx — j t l ^ x dx = 1, 

— ao t 

which ih obviously false. The correct way of calculating v/ (Q is here, of course, to 
take account of the variable lower limit of the integral, thus obtainiDg 


u 0 (t)^ f 1*0. 


Ill) Integration. 一 If t for almost all (P) values of x in 8 y the func¬ 
tion g (x, t) is c(mtinuous with respect to t in the finite open interval (a, b) 
and satisfies the condition | </ («, 01 < G 4 (x) /or all t in (a, then 


(7.3.3) 


h h 

/ «(<) rf < == / [/y (*, t)dF (x)] d t 

a a ^ 

= /[/ 9{x, t)dt]d F{x). 

•、• a 

68 



7.3 


Further, if the above conditions are satisfied for every finite interval 

鷂 

(a, &) and in addition, we have f \g(x,t)\dt < G 6 (x), then 1 ) 

•— OB 

(7.3.4) ju(t)dt=^ f[fg(x,t)dt]d F(x). 

—ao S 騎 9 

We consider first the case of a finite interval (a, b). For almost 
all (P) values of x in 5, the integral 

t 

h(xj) = f g (x, r)dt 

a 

0 ft (X 

has, by (5.1.4), for all t in (a, 6) the partial derivate — -= g (x, t) } 

so that we have —< G 4 (x). Further | /i(cr, 01 < (& 一 4 h ㈤ ， 

so that h(x,t) is integrable over S with respect to F(x). Writing 

v(t)== f h(x, t)d JF(x), 
s 

we may now apply the remark to theorem II), and find 

v (() = f g(x,t)dF(x) = u(t). 

•、 

By I), the function ti ⑺ is continuous in (a, 6), so that the difference 

e 

^(t) = f u(t) dr 一 v(/) 

a 

has a derivative /⑻ =ti ⑺ 一 t/ ⑺ =Q. For < = a, we have h (a:, a) = 0, 
i; (a) = 0, and thus z/ (a) == 0 follows that ^[t)=-0 for a ^ t ^ by 
and thus in particular ^f(b) . e jti which ii» identical with (7.3.3). 

When the conditions 'econd part of the theorem are satis¬ 
fied, (7.3.3) holds for any f a, 5), and we have 

_ h 

i 

*) It is erident how the ei « i should be modified 、、 hen、ve want to integrate 
u (t) over (a, <») or (— «, b). ▼ eD 


69 



7.3-4 


/ / [/1 9 {x, t) I d Z (, (a;)] = / [/ |j>(a ：， <)|<it]rf_F(a;) 

a n .、’ H a 

S / G^dFix). 

a 

00 

Thus the integral f |m ⑺ |rff ia convergent. If, in the relation (7.3.3), 
— 00 

we allow a and b to tend to — oo and •+- oo respectively, it follows 
that the first member tends to the first member of (7.3.4). An app¬ 
lication of (7.2.2) shows that, at the same time, the second member 
of (7.3.3) tends to the second member of (7.3.4). Thus (7.3.4) is proved. 

The theorems proved in this paragraph show that, subject to 
certain conditions, analytical operations such as limit passages, differ¬ 
entiations and integrations with respect to a parameter may be per¬ 
formed under a sign of integration. 


7.4. Lebesgue-Stieitjes integrals with respect to a distribution. 
— If P(S) is the probability function of a distribution (cf 6.6), the 
integral 

00 00 

(7.4.1) / g(x)dP=^ f g(x)dP^ f ff(x)dF 

JC| ~ oc — oo 

may be concretely, though somewhat vaguely, interpreted as & weighted 
mean of the values of g(x) for all values of the weights being 
furnished by the mass quantities d P or d F situated in the neigh¬ 
bourhood of each point x. The sum of all weights is unity, since we 
have 

00 00 

/dP = /dF=P(R 1 )== 1. 


Every bounded and J?-measurable g(x) is integrable with respect to P 
(or F) over (— oo, oo). 

If the mass distribution is represented as the sum of two com¬ 
ponents according to (6.6.2), the integral (7.4.1) becomes 


00 eo ^ 

f g(x)dF=c t f g(x)dl c t f g(x)dF v 

—② ’ ^ t)d：" 


where the first term of the second n 
discrete mass points of the distributic 

70 


reduces to a smn over the 
in (7.1.8). 



7.4-5 


If, for a positive integer v } the function x v is integrable with 
respect to F(x) over (— oo, qo), the integral 

OD 

(X v zsz f x v dF(x) 

— 00 

is called the moment of order P, or simply the v.th moment, of the 
distribution, and we say that the v:th moment exists. It is then easily 
seen that any moment of order r’ < v also exists. 

It is known from elementary mechanics that the first order moment 
ct| is the abscissa of the centre of gravity of the mass in the distribu¬ 
tion, while the second order moment a, represents the moment of 
inertia of the mass with respect to a perpendicular axis through 
the point a; = 0. — The moments of a distribution will play an 
important part in the applications made later in this book. 

If, for some k > 0 t the cUttribution function F{r) satisfies the conditions (with 
respect to the not»tioiis, cf 12.1) 

F[x) = Oda*!-*) when : r — 一 co, 

1 一 ■F(ar) = (?(x - *) when x_+ «， 

thftu any moment of order v < k exists In order to prove this, it is according to 
7.2 sufficient to show that the integral of | x |，with respect to Or) over an interval 
[a, b) is less than % constant independent of a and b. Now we have by bypothesif 

a r 

j\x\^d F{pc)^2 Tv (F(2^)- F[ 2 r-\)) 

" T ~ l 含 2f，(l 一 F(2 卜 i))<j ^， 

where C is independent of r, and a similar relation for the integral over (— 2 r , — 2 r ~ l ] 
Summing over r 1, 2, and adding the integral over (— 1, 1), which is ^ 1, w< 
find for auy interval (a, b) 

b 

f |r|*rf^W<l + — 

a 

and thus th« vth moment exists 

7.5. The Riemann-Stieltjes integral. — Consider the Lebesgu< 
Stieltjes integral 

(7.5.1) I" g(x)dF(x) 

in the particular case when / is a finite half open interval 

71 



7.5 


/ = (a < re ^ 6), 

while g (a?) is continuous in I and tends to a finite limit as a: -► a 4- 0. 

We divide / in n sab-intervals u = (x*-i < x ^ a;*) by mean8 of 
the points 

a = x 0 < x x < ■ < a?» = 6 

and consider the Darboux sums (7.1.1) which correspond to the divi¬ 
sion I = ij + +*•„• We then obtain 

Z^2iM,[F{x v )-F{x^x)] y 

(7.5.2) 1 

名 = 2 m, [F(a; v ) — /^(xv-O), 
i 

vn v and M ¥ being the lower and upper bounds of g(x) in u. Now 
let £ > 0 be given. By hypothesis we can then find d such that 
M v — m, < « as soon aa x v — x v -i < 6. Choosing n and the x ¥ such 
that x v — x v -i < d for all V， we then have 

Thus when n tends to infinity, and at the same time the maximum 
length of the sub-intervals u tends to zero, Z and z tend to a common 
limit which must be equal to the integral (7.5.1): 

b 

(7.5.3) lim Z = lim z = ( g (x) d F(x). 

»-• oo n-*«) n 

Thus in the particular case here considered the simple expression 

(7.5.2) of the Darboux sums is sufficient to determine the value of 
the Lebesgue-Stieljes integral. If we put F(x) = x, these expressions 
become identical with the Darboux sums considered in the theory of 
the ordinary Eiemann integral. Accordingly, the integral defined by 

(7.5.3) is called a Riemann-Stieltjes integral. It follows from the above 
that, when this integral exists, it always has the same value as the 
corresponding Lebesgrue-Stieitjes integral. 

If, in every sab-interval t.，，we take an arbitrary point §，， we 
obviously have 

(7.5.4) lim y.g(^) [F(x，) - jP(x ， -i)] = f g(x)d F(x) t 

«-*«0 — J 

1 A 

8ince the snm in the first member is included between z and Z. 


72 



7.5 


The Biemann-Stieltjes integral (7.5.3) exists even in the more 
general cue when g(x) is bounded in (a, b) and has at most a finite 
number of discontinuity points r，, provided that F(x) is contivuous in 
every r-. We can, in fact, then surround each r，by a sub-interval /, 
which gives an arbitrarily small contribution to the sums z and Z. 

In the particular case when F(x) is continuous everywhere in (a, b) 
and has a continuous derivative F f (x), except at most in a finite 
number of points, we have for every n not containing any of the 
exceptional points 

JF(a:，) 一 F(x v -i) = (x 9 — a:，-!) F'(g ，)， 

where in a point belonging to i % . By means of (7.5.4) it follows 
that in this case the integral (7.5.3) reduces to an ordinary Riemann 
integral- 

b h 

(7.5.5) f g[x)dF{r) = J g (x) (r) dx 

a a 

All these properties immediately extend themselves to the case of 
a complex-valued function g (x), and also to infinite intervals (a. 6) 
subject to the condition that g(x) is integrable over (a, ft) with respect 
to ^(rr). If this condition is satisfied, we have e.g. the following 
generalization of (7.5.4) 

(7.5.6) FU，-.)! = / 
n ~* 1 — 00 

where as before the maximum length of the sub-intervals x,) 

tends to zero as ，i _ oo ， while at the same time x 0 —► — co and 
» + oo. 

Suppose now that two non-decreasing functions F (x) and G (x) 
are given, which are both continuous in the closed interval (a, 6), ex¬ 
cept at most for a finite number of discontinuity points, which are 
all inner points of (a, b) We further suppose that no point in [a, b) 
is a discontinuity point for both functions F and G. Choosing the 
sub-intervals so that no x v is a discontinuity point, we then have 

F(6) G ⑹ — F(a) [F(x v )G(r t )- G (x^ x )\ 

i 


73 



7.5 


The two terms in the last expression are included between the lower 
and upper Darboux sums corresponding; to the integrals f Fd G and 
J GdF respectively. Passing to the limit, we thus obtain the formula 
of partial integration : 

h h b 

(7.5.7) fd(FG) = fFdG + fGdF. 

a a a 

Finally, we consider a sequence of distribution functions (cf 6.7) 
FjJx), JFf(jr), . . which converge to a non-decreasing function F(x) 
in every continuity point of the latter. (By 6.7, the limit F(x) is not 
necessarily a distribution function.) Let g(x) be everywhere continuous. 
For any finite interval (a, b) such that a and b are continuity points 
of 2^(3?), an inspection of the Darboux sams that determine the inte¬ 
grals then shows that we have 

h h 

(7.5.8) lim fg(x)dF n (x)^= fg(x)dF(x), 

« « 

Suppose further that, to any c > 0, we can find A such that 
j |y(x)|rfF B (x) + / |j(a:)trfF„(x) < * 

— ob A 


for n = 1, 2, . . . We may then always choose A such that F(x) is 
continuous for x — and by means of (7.5.8) we find that 

f g{x)\dF n (x) — f |gr(a:)|dF(x) 

A A 


where B > A is another continuity point of F(x). Thus the last 
integral is ^ £ for any B> A, and for the integrral over (— B, — -4) 
there is a corresponding relation. It follows that gipc) is integrable 
over ( — qo , ao) with respect to F (x). If, in (7.5.8)，we take 0 = — A 
and 6 = + _4，each integral will differ by at most 2 « from the cor¬ 
responding integral over ( 一 ①， °°). Since e is arbitrary, we then 


(7.5.9) 


lim f g(x)dF n (x)^ f g (x) dF(x). 


This relation is immediately extended to complex-valued functions g(x). 

74 



7.5 


References to chapters 4 «— 7. 一 The classical theory of integration received 
its final form in a famous paper by Riemann (1864). About 1900， the theory of the 
measure of sets of points was founded by Borel and Lebesgue, and the latter intro¬ 
duced the concept of integral which bears his name. The integral with respect to a 
non-decreasing function F(x) had been considerfd already in 1894 by Stieltjes, and 
in 1913 Radon (Ref. 206) investigated the general properties of additive set functions, 
and the theory of integration with respecfc to such functions. 

There are a great number of treatises on modern integration theory. The reader 
is particularly referred to the books of Lebesgue himnelf (Ref. 23), de la Vallee Pous¬ 
sin (Kef. 40) and Saks (Ref. 23). De la Vallee Poussin gives an excellent introduction 
to the theory of the Lebcsgac integral, aud contains also some chapters on additive 
set functions, while the two other books go deeper into the more difficult parts of 
the theory. 


75 



Chapters 8-9. Theory of Measure and Integration inR h * 


CHAPTER 8. 

Lebesgue Measure and other Additive Set Functions in R n . 

8.1. Lebes&ue measure in R n - — The elementary measure of 
extension of a one dimensional interval is the length of the interval. 
The corresponding measure for a two-dimensional interval (cf 3.1) is 
tbe area, and for a three-dimensional interval the volume of the interval. 

Generally, if i denotes the finite M-dimensionai interval defined by 
the inequalities 

a v ^x v ^ (v — 1, 2, . . w), 

we shall define the n-dtmcnsinnal volume of the interval i as the non- 
negative quantity 


i w = n ( 办， — a v ). 

i 

For an open or half open interval with the same extremes a v and b Vy 
the volume will be the same as in the case of the closed interval. 
A degenerate interval has always the volume zero. For an infinite 
non-degenerate interval, we put L(t) == 4 - co. 

The Borel lemma (cf 4.1) is directly extended to n dimensions, and 
by an easy generalization of the proof of (4.1.1) we find that L(t) is 
an additive function of the interval. 

In the same way as in 4.2, we now ask if a measure with the 
same fundamental properties as L (i) can be defined even for a more 
general class of sets than intervals. — We thus want to find a non¬ 
negative and additive set function L(S) y defined for all Borel sets S 
in R n} and taking the value L(i) as soon as S is an interval i. In 
4.3—4 7, we have given a detailed treatment of this problem in the 
case n= 1, and we have seen that there is a unique solution, viz. 
the Lebesgue measure in H,. The case of a general n requires no 
modification whatever Every word and every formula of 4.3 一 4 7 


76 



8.1-2 


hold true, if linear sets are throughout replaced by n dimensional 
ones, and the length of a linear interval is replaced by the w-dimen 
sional volume. 

It thus follows that there is a non.negative and additive set function 
L(5), uniquely defined for all Borel sets S in R n and such that, in the 
particular case when S is an tnterval、L (S) is equal to the n-dimenstonaf 
volume of the interval. L(S) is called the n-dimenstonal Lebesgue measure 1 ) 
of S. 

8.2. Non-negative additive set functions in R n » — In the same 
way as in the one-dimensional case, we may also for n > 1 consider 
non-negative and additive set functions P(S) of a more general kind 
than the n-dimensional Lebesgue measure L(S). 

We shall consider set functions P(S) defined for all Borel sets S 
in R n and satisfying the conditions A) 一 C) of 6 2. It is immediately 
seen that these conditions do not contain any reference to the number 
of dimensions. The relations (6.2.1) 一 (6.2 3) then obviously hold for 
anj number of dimensions 

With any set function P(S) of this type we may associate a point 
function F(*) = F(xj, . . jc„), in a similar way as shown by (6 2 4) 

for the one-dimensional case The direct generalization of (6 2 4) is 
howerer, somewhat cumbersome for a general w, and we shall content 
ourselves to develop the formulae for the particular c 纛 ae of a bounded 
P(S) y where the definition of the awociated point function may be 
simplified in the way shown for the one-dimensional case by (6 5.1) 
This will be done in the following paragraph. 

As in the case n= \ y any non negative and additive set function 
P(S) in R n defines an " dimensional P-measure of the set S, which 
constitutes a generalization of the li-dimensional Lebesgue measure 
L[S). The remarks of 6.4 on sets of P-measure zero apply to sets in 
any number of dimensions 


*) In order to be quite precist% we ought to adopt a notatiun showing explicitly 
the number of dimensions, e. g by writing L„(5) instead of L(S). There should, 
however, be &o risk of misunderst 瓢 ndiDg, if it is always borne in mind that the 
measnr« of a given point set is relative to the space iD which it is considered Thu’ 
if yst consider e g. the interval (0,1) on a straight line a> » set of points in its 
(one*dimention»l) measure lias the value 1. If, on the other hand, take the line 
a* x.ftxU in » plane, and consider tbe same interral as a «et of points in R%, we are 
concerned with a degenernte interval, the (two-dimensional) measure of which is 
equal to zero 


77 



8.3 


8.3. Bounded set functions. — When P(H n ) is finite, we shall §aj 
(cf 6.5) that P(S) is bounded. We have then always P(^) ^ P(^n)- 
For a bounded P(5) we define, in generalization of (6.5.1): 

(8.3.1) F(*) = F(x lt . . Xn) == P(?I ^ x lt . . ^ ^»). 

Evidently F(x) is, in each variable x Vy a non-decreasing ： function which 
is everywhere continuous to the rig^ht, and we have for all * (cf 6.5.2) 

0^ F(x) ^ P(R n ). 

In the one-dimensional case, the value of P(S) for a half open 
interval i i defined by a < x ^ a h is, by (6.2.5), given by a first 
order difference of F(xY 

P{i x ) = JF(a)^ F(a 十 A) — F(a). 

This formula may be generalized to the case of an arbitrary n. Con¬ 
sider first a set function P(5) in H tt and a two-dimensional interval 
i % defined by a x < x x A,, a 9 < x t ^ a f + h r We then have 

(8.3.2) P{i t )^J t F{a i ,a t ) 

― 4- h it a t + h M ) 一 F(a ly a f + A f ) — F(a x 4 - h l% a t ) + ^( fl i» a t)- 

This will be clear from Fig. 2. If M u . . 3f 4 are the values assumed 

bj P(S) for each of the rectangular domains indicated in the figure, 
the additive property of P(S) gives 

f + if, + MJ — 况 + M t ) - (M, -f M z ) + M v 

and according to the definition (8.3.1) of F(x) % this is identical with 

(8.3.2) . 

(<*i + Ai, «i + ht) 



Fig. 2. Set functions nnd point fnnctioni in Itf. 

78 



8.3 


The generalization to an arbitrary n is immediate If P{S) is a 
set function JR” ， and if in is the half-open interval defined bv 
a v < x % ^ a, 4 - h v for v — 1, 2, . w, we have 

户 (/，*) = (^|> - •» ^n) 

= y(a x + /lj, . , a H + hn) 

(8 3.3) — /’(a" « 2 + /i 2 , • a " 十 /*m) 

• - 一 + A,, . • ， fln-i + h n -i 、 a") 

+ 

+ ( — l)’ l /’(a" , (tn) 

To any bounded P(S) in K„, there thus corresponds a point function 
F(a ： j, • • •， jC u ) which, in each x Vl is non-decreasing- and continuous to 
the right, and is such that the w.th difference F as defined by (8 3 3) 
is always non-negative 一 Conversely, a generalization of the argu¬ 
ment of G 3 shows that any given F with these properties uniquely 
determines a set function P(S) satisfying the conditions A) 一 C) of 
6.2, which for any interval i n assumes the value jfiven by (8.3.H) 
When one of the variables in l\ say x V) tends to — oo ， while all 
the others remain fixed, it is shown as in 6.5 that F tends to zero 
Similarly, when all the x v tend simultaneously to + y F tends to 
P(Rn). ^ 

When all the variables in F except one, say jr V} tend to + oc • 
F will tend to a limit, which is a bounded non-decreasing function 
F ¥ {x v ) of the remaining variable x，. By 6.2，the function F y [x v ) has 
at most an enumerable number of discontinuity points zl y . . Let 
us consider these as excluded values for the variable o%，which is thus 
only allowed to assume values different from z y> • • • In the same 
way, each variable x lt . . x n has its own finite or enumerable set of 
excluded values. 一 For any non-excluikd point x = (xj, . . x n ), the 
function F is continuous. This follows from the inequality 

\F(x + *) - F{x)\^ F(x + 叫)一 F(x -|fc|)^ 

n 

^ 2 (-FVte* + \ht\) 一 F v {x v 一 I A v |)), 


79 



8.3-4 


where ~ A») i ■ 纛 n arbitrary point, while \h\ denotes here 

the point (| A, | h n |)> and the sums and differences * + etc. are 

formed according to the rule« o£ rector addition (cf 11.1 一 11.2). An 
inspection of Fig. 2 will help to make this inequality clear. 

An ，/ .dimenaional interval such that none of the extremes a y and 
h 、 is an excluded value for the corresponding Tariable x 9 is called & 
continmty interval of P(5). The value assumed by P(S) when 5 is a 
continuity interval will obvioualy change in a continuous way for 
small variations in the a v and b,. If two bounded set functions in 
R n a^ree for all intervals that are continuity interTal* for both, it 
follows (cf 6.7) that the set functions are identical. 

8.4. Distributions. — Non negative and additive set functions P(S) 
such that P(H n ) = 1 play, like the corresponding one-dimensional func¬ 
tions (cf 6.6), a fundamental part in the applications. By the preceding 
paragraph, the point function F(x) associated with a set function 
P(S) of this class satisfies the relations 

尸 (:) = • • .， X”) ==■ 1^(^| ^ • • •) fn » 3J«)» 

0 ^ F(x) ^ 1, ^/nF>0, 

1 F (— 3o ， r,, • . ” x n ) == - == F(xj, . . Xn-i, — 00 ) = 0. 

F (+ 0C, • • ., + co) = 1. 

As in the one-dimensional case, the functions P(S) and F(») will 
be interpreted by means of a distribution of a unit of mass over the 
space R n , such that every Borel set S carries the mass P(S). Ab in 
G.6, we are at liberty to define the distribution either by the set func¬ 
tion P(5) or by the corresponding point function JFW ， which<represents 
the quantity of mass allotted to the infinite interrai ^ Sx x , . . ^ x„. 

The difference between these two equivalent modes of definition is. 
of course, only formal, and it will be % matter of convenience to 
decide which of them should be used in a given case. 一 As in 6.6, 
P(S) will be called the probability function, and «F(*) the distribution 
function of the distribution. 

Thus a distribution function is a function F(jt) = F(x iy . . x n ) 
which, in each is non-decreasing and ererjwhere continuous to the 
rigfht, and is such that the w:th difference as defined bj (8.3.3) is al¬ 
ways non-negative. Convcraelj. it follows from the preceding paragraph 
that any given F with these properties is the distribution function 
of a uniquely determined distribution in H n . 

BO 



8.4 


If the set which consists of the single point * = « carries a posi¬ 
tive quantity of mass, a is a discrete mass point of the distribution. 
The set of all discrete mass points of a distribution is enumerable, 
as we find by a direct generalization of the corresponding- proof in 
6 2. Obviously any discrete mass point « is a discontinuity point for 
the distribution function F. In the case w = 1 we have seen in 6.6 
that, conversely, F is continuous in all points * except the discrete 
mass points. This ts generally not true when « > 1 In fact, in a 
multi-dimensional space the mass may be distributed on lines, surfaces 
or hypersurfaces in such a way that there is no single point carrying 
a positive quantity of mass, while still F may be discontinuous in 
certain points. In the preceding paragraph we have, however，seen 
that it is possible to exclude certain values for each variable so 
that the function F will be continuous in all »non-excluded» points. 

Consider e g. a distribution of a mass unit with uniform density over the inter¬ 
val (0,1) of the j'a-axis in the plane of the variables a*j, x* Obviously this distribu¬ 
tion has no discrete mass points, and still the corresponding distribution function 
F(x lt x t ) is discontinuous in every point (0, x # ) with x* > 0. Accordingly it will be 

seen that tho function Fxixx) = lim F{x lt x t ) discussed in the preceding paragraph 

r f -* + oe 

is here discontmuous for = 0, which is the only ^excluded* value for a^. For x% 
there are no excluded values, and accordingly F(xi ， x%) is continuous in any point 
(x Xt with Xi 0 

We further see that any distribution in R n can be uniquely re¬ 
presented in the form (6.6.2), as the sum of two components, the first 
of which corresponds to a distribution with its whole mass concen¬ 
trated in discrete mass points, while the second component corresponds 
to a distribution without discrete mass points. It follows from the 
above that, when n > 1, we cannot assert that the distribution func¬ 
tion of the second component is everywhere continuous. 

Let / denote the n-dimensional interval defined by 

x v ― h v < ^ a: ♦ + hy 

for y = 1, 2, . . , n. The ratio 

P(i)_ J n F 

77 ( 7 ) ^ 2 n k x h t ~~ hn 

where the difference J n F is defined as in (8.3.3), represents the average 
density of the mass in the interval I. If the partial derivative 

f(x^ . M X n ) = 


81 



8.4-5 


•ziata, the average density will tend to this value as all the tend 
to zoro t and accordingly f(x '、. . M x n ) represents the density of mass 
at th 4 point x. As in the one-dimensional case, this function will be 
called the probability density or the frequency function of the distribution. 

Let_F(a：|, • . a; n ) be the distribution function of a given distribution. 
When all the variables except x ¥ tend to + <x> y F will (cf 8.3) tend to 
a limit (a?*) which is a distribution function in x v . We have, e. g., 
F x ( Xl ) = F(x x> + oo, . • ” + oo). The function (x 9 ) defines a one¬ 
dimensional distribution, which will be called the marginal distribution 
of x v . We may obtain a concrete representation of this marginal 
distribution by allowing every mass particle in the original w-dimen- 
sional distribution to move in a direction perpendicular to the axis 
of x Vi until it arrives at a point of this axis. When, finally, the whole 
mass is in this way projected on the axis of a one-dimensional 
distribution is generated on the axis, and this is the marginal distri¬ 
bution of x 9 . Each variable x ¥ has, of coarse, its own marginal distri¬ 
bution, that may be different from the marginal distributions of the 
other variables. 

Let us novr take any group of k < n variables, say x v . . Xi, and 
allow the n 一 k remaining variables to tend to 4 co. Then F will 
tend to a distribution function in x u . . xt y which defines the bdi- 
mensional marginal distribution of this g^roup of variables. The distribu¬ 
tion may be concretely represented by a projection of the mass in 
the original ^-dimensional distribution on the Ar dimenaional snbspace 
(cf 3.5) of the variables x ly . . M Xk. 一 Let P be the probability func¬ 
tion of the w-dimen8ional distribution, while Pi,. is the probability 
function of the marginal distribution of x it . . Xt. Let, further, S* 
denote any set in the ^ dimensional subspace of x n . . xt t while S 
is the cylinder set (cf 3.5) of all points x in R n that are projected on 
the subspace in a point belonging to S'. Obviously we then have 

(8.4.2) P,. ,*(^) = P(S), 

which is the analytical expression of the projection of the mass in 
the original n-dimensional distribution on the Ar dimensional subspace 
of the variables x l% . . x*. 

The theory of distributions in R n will be further developed in 
Chs. 21—24. 

8.5. Sequences of distributions. 一 As in the one-dimensional case 
(cf 6.7), we shall say that a sequence of distributions in Rn is con- 

82 



8.5-6 


vergent y when the corresponding probability functions converge to a 
non-negative and additive set function P(S), in every continuity inter- 
vai of the latter. If, in addition, the limit P(S) is a probability func¬ 
tion, i.e. if P(H,,)=1, we shall say that the sequence converges to a 
distribution. From the point of view of the applications, it is generally 
only the latter mode of convergence that is important. 

For (i sequenct which is convergent without converging to a distribution, we 
h»ve P(Hn) < 1, which may be interpreted (cf tbe example discasBed in 6 7) bj sajring 
that a certain part of tbe dbmb in our distribntions ^escapes towards infinity，when 
we pass to tbe limit. 

A straightforward generalization of 6.7 will show that a sequence 
of distributions converges to a distribution when and only when 
the corresponding distribution functions F v F t , . • tend to a distribu¬ 

tion fanction F in all »non-excluded» (cf 8.3) points of the latter. 
A further criterion for deciding whether a given sequence of distribu¬ 
tions converges to a distribution or not will be given in 10.7. 

A8 in 6.8，we shall further say that a sequence of distribution 
functions F,, . . . is convergent, if there is a function F } non- 

decreasing in each x 9t such that F n — in every »non exciuded» point 
of F. We tlien always have 0 ^ ^ 1, but according to the above 

F is not neceMsarily a distribution function. We then have the fol¬ 
lowing generalization of tbe proposition proved in 6.8 for the one- 
dimenBional case: Every sequence of distribution functions contains a 
convergent subsequence. 一 - This may be proved by a fairly straight¬ 
forward generalization of the proof in 6 8, and we shall not give the 
proof here. 

8.6. Distributions in a product space. 一 Consider two spaces 
and R», with the variable points x = (a;,, . . x m ) and y = y v ) 

respectively. Suppose that in each space a distribution is griven，and 
let P x and F x denote the probability function and the distribution 
function of the distribution in while P 2 and F f have the analogous 
significance for the distribution iu 

In the product space (cf 3.5) H m . K n of m + n dimensions, we denote 
the variable point bj * = (x } y) = (x l% . . 怎爾， . • •， y”). If and 5, 
are sets in R m and K n respectively, we denote by S the rectangle set 
(cf 3.5) of all points # = (*,y) in the prod not space tacli that x < S t 
and y < S r 

It is almost evident that we can always find an infinite number 
of distributions in the product apace, such that for each of them the 

83 



8.6 


marginal distributions (cf 8.4) corresponding to the subspaces and 
H„ coincide with the two given distributions in these spaces. Among 
these distributions in the product space we shall particularly note 
one，which is of special importance for the applications. This is the 
distribution given by the following theorem. 

There is one and only one distribution in the product space R m • R n 
such that 

(8 6.1) P{S) = P l (S 1 )P t (S t ) 

for all rectangle sets S defined by the relations x <. S x and y < S t . This 
is the dutribution defined by the distribution function 

8.6.2) F(M)^F l (*)F t (y) 

/or all poiah % = (x t y). 

We first observe that 2^(*) as given by (8.6.2) is certainly a distri¬ 
bution function in R m - K n , since it satisfies the characteristic properties 
of a distribution function given in 8 4. Consider now the distribution 
defined by /’(*) By means of (8.3.3) it follows that we have 

P(/) = P, (7,)^(/,) 

for any half open interval / = (/,, I t ) defined by inequalities of the 
type a v < x v ^ b Vi c* < y* ^ d 9 . Now any Borel set S x may be formed 
from intervals /, by repetitions of the operations of addition and sub¬ 
traction (By (1.3.1), the operation of multiplication may be reduced 
to additions and subtractions.) By the additive property of P” it 
follows that for any rectangle set of the form S = (S ly /,) we have 

p(s)^p l (s l )P i (ftY 

and finally we obtain (8.6.1) by operating in the same way on inter¬ 
vals / 2 . 一 On the other hand, any distribution satisfying (8.6.1) also 
satisfies (8.6.2), the latter relation being, in fact, merely a particular 
case of the former. Since a distribution is uniquely determined by 
its distribution function, there can thus be only one distribution 
satisfying (8 6.1). 

If, in (8.6.1), we put S t ^R ni it follows from (8.4.2) that the mar¬ 
ginal distribution corresponding to the subspace R m coincides with the 
given distribution in this space, with the probability function P v 
Similarly, by putting S t = R mi we find that the margiual distribution 
in R n coincides with the given distribution in this space. 

84 



8.6 9.1 


We finally remark that the theorem may be generalized to distribu¬ 
tions in the product space of any number of spaces. The proof is 
quite similar to the above, and the relations (8 6 1) and (S*o 2) are 
replaced by the obvious generalizations 

P=P x P t Pk ami F = F, F % F k 


CHAPTER 9. 

The Lebesgue-Stieltjes Integral fob Functions 
of n Variables. 

9.1. The Lebesgue-Stieltjes integral. 一 The theory of the Lebesgue 
Stieltjes integral for functions of one variable developed in Ch 7 may 
be directly generalized to functions of n variables If, in the exprcs 
sions (7.1.1) of the Darboux sums, we allow /*(*S) to denote a non 
negative and additive sot function in R n , while m % and M, are tlie 
lower and upper bounds of a given function </(«) — </(x t , , v„) in 

the n-dimensional set 5,, the Lebesgue-Stieftjes integral 

(9 1 . 1 ) / g{*)^ P= \ 9 (^1 ，d P 

s s 

is defined in the same way as in the one-diniensional case 

The function (f(x) is said to be immeasurable in the set 5' if the 
subset of all points * in S such that y(x) ^ Jc is a Borel set for every 
real value of k. All remarks on /^-measurable functions given in 5 2 
extend themselves without difficulty to functions of n variables 

If f 7 (*) is bounded and /^-measurable in a set 6* of finite /’-measure, 
it is integrable over 6’ with respect to P The definitions of integral 
and integrability in the case of an unbounded function ^(*), and a. 
set 5 of infinite P measure, require only a straightforward generaliza¬ 
tion of 7.2. All properties of the integral mentioned in 7.1 一 7 3 
readily extend themselves to the case of n variables, all proofs being 
strictly analogous to those given in the case n =■- I 

In the particular case when P(^) is the n dimensional Lebesgue 
measure L(S), we obtain the Lebesgue integral of the function g(x) y 
which h also often written in the ordinary multiple integral notation 

I J (f (.7：,, , x n )dx l dr n 


85 



9.1-2 


If S is an internd, and ^f(«) is integrrable in the Biemann sense 
oyer the interfal, the Lebea^ue integrral coincides with the ordinary 
multiple Riemum integfral, as we have observed for the one-dimen¬ 
sional case in 5.1. 

9.2. LebMgue-Stleltjes Integrlils with respect to a distribution. — 
The remarks made on this subject in 7.4 evidently apply also in the 
case n > 1. 

The moments of a distributioo in R n are the integrals 
a*,.. = / «? • • • dP t 

Rn 

where the vt are non-negative integers. As in the one-dimensional 
case, we shall say that the above moment exists, whenever the function 
x\' . . . x 1 ^ is inte^rable over R n with respect to P. 

We shall now consider the integral 

(9.2.1) f 9 ^,x n )dP 

■Hn 

in the case when the function g only depends on a certain number 
of the variables, say x ti . . xt, where k <n. We denote by 
the ^dimensional snbspace of these variables. Let us first assume g 
bounded, and consider the divisions 

+ • + Sj, 

■R»i = 5^ + + S qy 

where the are Borel sets in Ri such that 5^ = 0 for fi 〆 v ， 

while 5， denotes the cylinder set (cf 3.5) in Rn which has the base Si. 
The upper Darboux sum 

^M q P(S q ) 

corresponding to the intej^ral (9.2.1) is then by (8.4.2) identical with 
the sum 

Z^M.Pr, .,*(«)+ ^M q P h . 

where Pi, . denotes the probability function of the marginal distri¬ 
bution of the variables x u . . Xu. This is, however, the upper Dar- 
bouz sum corresponding to the ^-dimensional integral 

fffdPi, ,i. 


86 



9.2-3 


As the same relation holds for the lower Darboux sums, it follows 
that we have for any bounded g (x Xy . . Xk) 

(9.2.2) fg(x i} . . = / g(x u . . x k )dPi, , k , 

Rn Rk 

so that in this case the n-dimensional integral reduces to a Ar-dimen- 
sional integral. 

It is easily seen that the same relation holds whenever g is inte- 
grable over K* with respect to P\, even if g is not bounded. We 
may also assume g complex-valued. 

9.3. A theorem on repeated integrals. —— If g{x,y) is continuous 
in the rectangle d y S d 、 we know that the relation 

b d b d d b 

// g(x,y)dxdy= j g(x,y)dy)dx = j (f g(x.y)dx)dy 

a c a. c c a 

holds, so that the double integral can be expressed in two ways as 
a repeated integral 一 There is a corresponding theorem for the 
Lebesgue-Stieltjes integral in any number of dimensions, and we shall 
now prove this theorem in a certain special case 

Using the same notations as in 8.6, we consider two probability 
functions P, and 1\ in the spaces R m and R n respectively, and the 
uniquely determined probability function P in the product space 
which satisfies (8 6.1). Let S, and S 2 denote given sets in R m 
and R„ respectively, while S = (5,, S f ) is the rectangle set in the 
product space with the »sides» S t and S t . Let further g(x) and h(y) 
be given point functions in R m and R n respectively, such that g(x) is 
integrable over 5, with respect to P,» while h(y) is integrable over 
with respect to P t . 

Then g(x)h(y) is integrable over S = (5„ S t ) with respect to P、and 
we have 

(9.3.1) j'g(x)h(y)dP^fg(x)dP l fh(y)dP t . 

S 5 , St 

Suppose fil.st that g(x) and h(y) are bounded and non-negative. 
Consider the Darboux sums corresponding to the three integrals in 

(9.3.1) , and to the divisions S, = ^ + • + S 9 = + + 邡)， 

*5 = 2 where 汶 ”） denotes the rectangle set (S ( }\ 约 ’). If these 

sums are denoted by z and Z for the integral in the first member, 

87 



9.3-5 

and by g ly Z t and r Sl Z t for the two integrals in the second member, 
it is seen that we have 

^ ^ ^ ^ ^ Z<g* 

By the definition of the integral, (9.3.1) then follows immediately. — 
Replacing further g and h by g — g* and b! — h "、where g\ g\ li 
and h" are bounded and non-negative, we obtain (9.3.1) for any real 
and bounded g and h. The extension to any integrable and complex- 
valued functions follows directly from the definition of the integral 
for these classes of functions. 

9.4. The Rlemann - Stieltjes integral. 一 The considerations of 7.5 

may also be generalized to n variables, where we have to employ the 

point function . . • ， a?”）and the difference J n F instead of the 

point funotion F(x) and the difference F(x v ) — 

• d n F 

In particular it follows that, if a continuous derivative ^ - 

exists for all points of the interval I (a ¥ b V} v = 1, . . M w), and 

if g(x) is continuous in /, then the integral (9.1.1) may, for S= /, 
be expressed as a multiple Riemann integral 

h n 

f g(*)dP=f ■■ Jg(x, . ~ ° ^q x dx i . • • 

/ (I, a n 

This property is immediately extended to the case of a complex-valued 
function g (*), and also to infinite intervals, subject to the condition 
that g(x) is integrable over 1 with respect to P. 

9,5 The Schwarz inequality. 一 Consider two real functions 夕 (*) 
and h (x) such that the squares g f and h* are inte^rable with respect 
to P over the set S in H n . The quadratic form 

f [ug{x) •¥ vh (*)]* r?P = u a J ff* dP + ^uvf gh dP -f v 1 f h f dP 

s s a s 

is non-negative for all real values of the variables u and r Thus 
(cf 11.10) the determinant of the form is non-negative, which implies 
that we have 

(9 5.1) (J'ffh <llf^ f/dP .f h*dP. 

> S 


88 



Chapters 10-12. Vamous Questions. 


CHAPTER 10. 

Fourier Integrals. 

For th« applications to probability theory and atatUtica,、、e shall require a certain 
number of theorems concerning some npecial classes of Fourier integrals, which will 
be deduced in this chapter. The general theory of the subject im treated e. g in 
books bj Bochner (Ref. 4), Titchmarsh (Ref. 38) and Wiener (Ref 41). 

10.1. The characteristic function of a distribution in R t . 一 Let 
F(x) denote a one-dimensional distribution function (cf 6.6), and t a 
real number. The function g (x) e ltx = cos tx + i sin tx is then, bv 
7.4, infcegrable over (— oo, oo) with respect to F(x\ since \e ltx \ = 1. 
The function of the real variable t 

00 

(10.1.1) (p (<) = f e tU dF(x) 

— 00 

will be called the characteristic function of the distribution corre¬ 
sponding to Z’Or). 

In general <p(t) is a complex-valued function of t. Obviously we 
always have gp (0) = 1, and for all values of t 

If ⑺ |S/ dF(x) = 1, 

— QO 

史(一 0 = y w , 

writing a for the conjugated complex quantity of a. It further follows 
from 7.3 that q>{t) is continuous for all real t. 

If the moment of order k of the distribution (cf 7.4) exists, it 
follows from 7.3 that we may differentiate (10.1.1) k times with re¬ 
spect to t, and thus obtain for 0 ^ v ^ it 

00 

dF (pc). 


( 10 . 1 . 2 ) 


89 



10.1 


Hence by 7.3 ^(0 is continuous for all real t t and we have 

Q 9 

g>(») (0) == i v f x v d F(x) = i 9 or,.. 

— 00 

In the neighbourhood of # = 0 we thus have a development in Mac- 
Laurin s series : 

k 

00.13) ^{<)=1 +2^(*<) f +o(^), 


where the error term, divided by t k f tends to zero as (cf 12.1). 

Conversely, if it is known that the characteristic function has, for the particular 
value f = 0, a finite derivative of even order 2 k t this derivative is equal to the limit 


iyt ^j (^y\i F(x) . 

— 0D ~ QO 

For any finite interval (a, 6) we have, however, by (7.1.7), 
b h 

J x^dF(x)= lim f 广 dF0r)_S|v ⑽⑼ 


It follows that the moment a 9i exists, and thus (10.1.2) holds for 0 ^ ^ 2 Ac and 

for all vnlnes of t. 


We thas see that the differentiability properties of g> (fl are related 
to the behaviour of i^(a?) for large values of x y since it id this behaviour 
that decides whether the moments ft，exist or not. It can also be 
shown that, converaelj, the behaviour of q>(t) at infinity is related to 
the continuity and differentiability properties of F(x). Suppose, e. g” 
that 2^(a:) is everywhere continuous, and that a continuous frequency 
function F 0 (a;) =/(a?) exists for all x, except at mo*t in a finite 
number of points. We then have by (7.5.5) 


(10.1.4) 


y (0 = / 产 /(») 咖， 


and it can be shown that tp{t) tends to zero as / ± <». If, more¬ 

over, the w:th derivative (x) exists for all x and is sacli. that 
|/i")(x)| is integrable over (— »,»), a repeated partial integration 
shows that we have 


90 



10.1 


for ail t, where K is a constant. We shall, however,' not give a de¬ 
tailed proof of these properties here. 

Suppose, on the other hand, that F(x) is a step-function with steps 
of the height p y in the points x — We then have by (7.1.8) 

(10.1.6) 妒⑷ = 2 夕，产 *， 


the series being absolutely and uniformly convergent for all t y since 
=1. Each term of the series is a periodic function of t, and 

* 

thus certainly does not tend to zero as < —► + qo . It can be shown 
that aUo the sum of the series does not tend to zero as t ± oo. 
Thus e. g. the characteristic function of the distribution function e(x) 
defined by (6.7.1) is identically equal to 1. 


Not every function <p{t) may be tb« characteristic function of a distribution. 
Necessary conditions are, according to the above, that <p{t) should be everywhere 
continuous and such that | <p(t) | ^ X t y (0) = 1 and <p{— t) == <p{t) These conditions 
nrc, however, not sufficient. If, e. g , <p{t) is near t = 0 of the form ^ (0 == 1 + OU 2+rf ), 
where <5 > 0. then it follows from (10 1.3) that the distribution corresponding to (p(t) 
must have aj = « t = 0, which means (cf 16 1) that the whole mass of the distribution 
is concentrated in the point t = 0 This is, however, the distribution which has the 
distribution function f(x) and the characteristic function y (0 = 1. Hence in this 
case tp{t) oannot be a characteristic function unless it is identically equal to 1 Thus 


e. g. the functions e — # and 


1 

IT? 


are no characteristic functions, though both satisfy 


the nbove uecessary conditions. 

Various neceisary ayid sufficient conditions arc known. The simplest seem to be 
the following (Cramer, Ref. 71) In order that a given, bounded and continuous function 
tp(t) should be the characteristic function of a distribution, tt ts necessary and sufficient 
that y (0) = 1 and that the funetton 

A A 

V (.r, A) = Jj fp{t — u) e 1 dt du 
oo 


is real and non-negative for all real x and all il > 0. 

That these conditions are necessary is easily shown. When <p (0 is the characteristic 
function corresponding to the distribution function F{x )、ve find, m fact, 


V> (ar, i4) = 2 



1 一 co 8 4 (jt 十 y ) 

^~ (怎 + y )’~^ 


dF[y), 


and the last expression is evidently real and noo-negatiTe. 一 The proof that the 
conditions are anffident depends on the propertied of certain integral r analogous to 
those used in the two following paragraphs. It is, however, somewhat intricate and 
will not he given here. 


91 



toa 


10.2. Some auxiliary functions. — Consider the functions 

a[KT)== lj^hl dtr 

o 

c(k ， r)4p. :; 8 - ht dt ， 

o 

where h is real and T> 0. Obviously c(A, T) ^ 0, and 
« (— MO =—#(*, n c ( 一 A,r) = c(A ， D. 


Bj simple transformations we obtain for h> 0 

h T 

s(h,T)^lj B ^dt, 

0 

hT 

/, ^ 2h r sint ,. 2 1 — cos h T 

C(h ' T) -^j-T dt -n~f —■ 

0 

Now it is proved In text books on Integral Calculus that the integral 


o 

is bounded for all a: > 0 and tends to the limit ^ as x -► oo. 

It follows that s(h, T) is bounded for all real h and all T >0 and 
that we have, uniformly for | A | > d > 0, 

1 for A > 0, 

0 » h = 0, 

—1 » h <0. 

We further obtain for all real h 


(10.2.1) \ims(h } T)= 

r-*oo 


(10.2.2) lim C (/r, = dt^\h\. 


92 



10.3 


10.3. Uniqueness theorems tor characteristic functions in R t . 一 If 

(a — h 、 a + h) is a continuity interval ((/ 6 7) of the distribution function 
F(x)，nv have 

T 

(10.3.1) F(a + h)-F(a-h)^\im~ f 

T-oo ^ J f 

-T 

This important theorem (L6vy, Ref. 24) shows that a distribution 
is uniquely determined by its characteristic function. In fact, if two 
distributions have the same characteristic function the theorem 
shows that the two distributions agree for every interval that is a 
continuity interval for both distributions. Then, by 6 7, the distri¬ 
butions are identical. 

In order to prove the theorem, we write 

r t co 

J = — / —f —— e- ,0 d< f e ,tx dF[x). 

7t J t 71J t J 

— j* — T — oo 

Now the modulus of the function 申 ? 卢人 is at most equal to /t, 

so that the conditions stated in 7.3 for the reversion of the order of 
integration are satisfied. Hence 


J=iJ'dF(x) f e “ ( 卜 《) dt = - n jdF(x) J cos (x-a)tdt 


-K 

-T 

— 00 0 

oo 

where 

T 

T 

= T)dF(x), 

— 00 


,(x. T)^f^-co a ( X -a)tdt^ l -f 9in (x ~ a + h)t (U 

J t 7t J t 


0 0 
T 

f - in - ^ (X — a + h,T)-- 2 s(x-a-h,T). 

0 

Thus by the preceding paragraph | g (x, T) | is less than an absolute 
constant, and we have 


93 



10.3 


Umg(x,T) 


0 for a? < fl — A, 
i » a: = a 一 h 、 

1 » a-h<x<a-¥h, 

i » x~ a + h t 
0 '» a: > a 十心 


We may thus apply theorem (7.2.2) and so obtain, since ^(a?) is con¬ 
tinuous tor x = a ±h } 


lim J 

T-* oo 


a-¥h 

~ J ^ 多 ’W = ^( a + 厶 ) - ^( a - 办)， 

<i 一 A 


so that (10.3.1) is proved. 

In the particular case when ⑺ | ia integrable over (— 00 , 00 ), it 
follows from (10.3.1) that we have 

— OD 


as soon as jP is continuous in the points a? ± fc. When h tends to 
zero, the function under the integral tends to while its 

modulus is dominated by the integ^rable function | g> (^) |. Thus we may 
apply (7.3.1), and find that the derivative F 0 (x) = f(x) exists for all x, 
and that we have 

OO 

(10.3.2) /(x) = ^J e- ta q>(t)dt. 


Then f(x) is the frequency function (cf 6.6) of the distribution, and 
it foilowa from 7.3 that f(x) is continuous for all values of x. — We 
call attention to the mutual reciprocity between the relations (10.3.2) 
and (10.1.4). 

In order to determine F(x) by means of (10.9.1) we must know ip if) over the 
whole infinite interval (— oo, oo). The knowledge of 炉 ⑺ over • finite interval is, in 
fact, not suffleient for a unique determination of <F(aO. This follows from an example 
given by Gnedenko (Ref. 117) of boo ckaraeteriitic /ttneiions which agree over a finite 
interval without being identical for all t We shall give a somewhat iimpler example 
due to Khintchine. The two functions 


H>xW 


»iW) * 


l - |/| for \t\^ 1. 

0 for M>1, 

4 (cos n t cos 3 tv ^ con 6nt 
J? 3 1 + 6" 


94 



10.3 


m both ehMMUrlttlo function*. ^iU) i* the cbaraotoriatic function of the distribu¬ 
tion d«llocd bj the fraquenej function 




1 — eo« ap 
nx* ' 


m may b« teenbj taking A ■■ 1 • F(x)^ t (as) and f(() — 1 in (10.8.8), while ip t (t) eorroBpumU 
to a diitrlbation bAfing fcb« mau | placed in the point * = 0, and the mass m 

th« point T ■■ nn, where H ■ ± 1 ，士 8, •. 一 By tammation of the trigonometrical 
MriM for ^p B (f) it it smb that f s (t) ** ^i(0 for | f | S5 1. For | ^| > 1, on the other 
hand, 外 （0 U equal to sero, whil« ^ t (0 periodical with the period 2. 

We now proceed to prove a formula which is closely related to 
(10.8.1 )， but differs from it by containing an absolutely convergent 
intepul. In the following paragraph, this formula will find an im¬ 
portant application. 一 For any real a and h> 0 we have 


(10.3.8) j \F(a -f f) — F(a — e)]di = — er ttn g> (t) dt. 

u —• 


Transtorminsr the integmi in the second member in the same way 
u in the proof of (10.3.1), the retremioii of the order of integrration 
is jastified by means of 7.3. Denoting the second member of (10 3.3) 
by J it we then obtain 

= dF(x) j 1 dt 

—• — oo 

=^ J* d F(x) J cos (x — a) t dt. 

— 0B 0 

In the same way as above it then follows from (10.2.2) 

J t ^j\ x ~ a + k \ + \ X .~Z? : 吐二 11 £. 二 5 ~ F ( x ) 

— ® 

O +為 

=J (A — I a: - a\)d F[x). 


Appljxng the formula of partial integration (7.5.7) to the last integral ， 
taken over each of the intervals (a — A, a) and (a, a + fc) separately, 
it U finally seen that J x is identical with the expression in the first 
member of (10.3.3), so that this relation is proved. 

95 



10.4 


10.4. Continuity theorem for characteristic functions in R t . 一 We 
have seen in the preceding paragraph that there is a one-to-one cor¬ 
respondence between a distribution and its characteristic function gp ( 0 . 
A distribution function F(x) is thus always uniquely determined by 
the corresponding characteristic function (p(t), and the transformation 
by which we pass from F(x) to gp ⑺， or conversely, is always unique. We 
shall now prove a theorem which shows that, subject to certain con¬ 
ditions, this transformation is also continuous, so that the relations 
F n (x) -* F(x) and 史 ”⑺— y ⑺ are equivalent. 

This theorem is of the highest importance for the applications, 
since it affords a criterion which often permits us to decide whether 
a given sequence of distributions converges to a distribution or not. 
We have seen in 6.7 that & sequence of distributions converges to a 
distribution when and only when the corresponding sequence of 
distribution functions converges to a distribution function. In the 
applications it is, however, sometimes very difficult to investigate 
directly the convergence of a sequence of distribution functions, while 
the convergence problem for the corresponding sequence of charac¬ 
teristic functions mav be comparatively easy to solve. In such situa¬ 
tions, we shall often have occasion to use the following theorem, 
which is due to Levy (Ref. 24, 25) and Cramer (Ref. 11). 

We are given a sequence of distributions, with the distribution funC' 
tions and the characteristic functions 9 ^ ⑺， ... 

A necessary ami sufficient condition for the comergence of the sequence 
|F n (a;)} to a distribution function F(x) is that，for every t，the sequence 
{gp n ⑺ I converges to a limit q> (t), which is contiguous for the special 
value t = 0. 

When this condition ts satisfied, the limit g>(t) in identical with the 
characteristic function of the limiting distribution function F(x). 

We shall first show that the condition is necessary, and that the 
limit <p(t) is the characteristic function of F(a:) This is, in fact, an 
immediate corollary of (7.5.9), since the conditions of this relation are 
evidently satisfied if we take <j(x) = e 1t£ . 

The main difficulty lies in the proof that the condition is sufficient 
We then assume that gp n ⑺ tends for every ( to a limit 4p(t) which 
is continuous for f = 0 , and we shall prove that under this hypothesis 

(: c) tends to a distribution function F(x). If this is proved, it 
follows from the first part of the theorem that the limit q>(i) is 
identical with the characteristic function of F(a:) 

By 6 8 the sequence {F n fr)l containR a sub sequence {eon- 

96 



10.4 


vergent to a non-decreasing function jF(x), where 2^(^) may be de* 
termined bo as to be everywhere continuous to the right. Wo shall 
first prove that /'(x) is a distribution function. As we obviously have 
0 ^ F(ar) ^1, it is sufficient to prove that F(+ oo ) (一 oo ) 二 1 
Prom (10.3.3) we obtain, putting a ~ 0, 

J 乂 w rf: - 

0 — h — ao 


On both aides of this relation, we may allow v to tend to infinity 
under the integrals. In fact, the integrals on the left are taken over 
finite intervals, where F n , is uniformly bounded and tends almost 
everywhere to F, so that we may appljr (5.3 6) On the right, the 
modulus of the function under the integral is dominated by the function 

-- which is integrable over (— qo , oc), so that we may apply 

the more general theorem (5.5.2). We thus obtain, dividing by h, 


F(z) dz-- h \ F(z) de^~ 


coa h t 


t* 


dt 


1 C 1 ― C08 t I t\ lt 

;./ 一， — 卜 


In this relation, we now allow h to aBsutne a sequence of values 
tending to infinity The first member then obviously tends to 
_F(+ ao) ■— F (— <») On the other hand, (p(t) is continuous for t = 0, 


bo that q> tends for every t to the limit; 炉⑼ We have, however, 
(p (0) = lim q> n (0), but qp n (0) = 1 for every w, since tp u (t) is a charac¬ 


teristic function Hence 少⑼ 二 1. Applying once more (5 5 2), we 
thus obtain from the last integral, using (10.2 2), 


JF (+ oo) - F (- oo) = i I Lzl£25J (/< ^ ! 

— eo 

Thus we must have F( -f oo)=vl, F (— co) 一 0， aud the limit /♦'(.r) of 
the sequence is a distribution function. — By the Hrst part 

of the proof, it then follows that the limit tp(t) of the sequence 
(^•*(01 is identical with the characteristic function of F(x) 

97 



10.4 


Consider now another eonrergent snb-seqnenoe of {F»(a ;)}， fluid 
denote the limit of the new sub-sequence bj F. (x 、， always auuming 
this fnnotioa to be determined so m to be everywhere continuous to 
the right. In ibe same way as before, it is then shown that F^(x) 
is » distribution function. Bj hypothesis the characteristic function 讎 
of the new tub-sequence have, however, for all values of t the tame 
limit ^p(t) m before, so that is the charaoteristio function of both 
F(a;) and (a?). Then according to the uniqueness theorem (10.3.1) 
we bare F(x) = F* (x) for all x. 

Thus every convergent tub-sequence of {i^(x)} haa the tame limit 
F(x). This is, however, eqni?alent to the statement that the sequence 
|F w (a;)} conrerges to F(x), and since we have shown that » 

diatribution function, our theorem is proved. 


W® know from 10.1 that a chftracteriatic fanetion it »1 wajb continuous for trnj 
t. Thm it follow 猶 from the Above theorem that, m toon th« limit f»(() of a 
seqaence of characteristic function> it continuou 應 for the apecial ta1q« t _ 0, U is 
continnou 騰 for trary t. The condition th»t the limit should be continuous for tbs 
special ▼»lue f * 0 it, how«T«r, MsentUl for the truth of the theorem. 

We sbftU, in fact, show by an example tbat the theorem ia not true, if thii con¬ 
dition is omitted. — Let F n {x) b® the distribution function defined by 




0 tor x $ — n, 


x + w 

Yn 


—n < x < n t 


» x ^ n. 


The corresponding frequency function ia oomtent equ»l to in the inter ▼ 魏 1( 一 n t n\ 
and diMpp«ar« ontoide thikt interT»l. The corraaponding eharacteriatie function ia 
by (10.1 4} • 


⑺ 



t*^*dx 


■in nt 
nt 


As n tends to inanity, oonvtrgM for t to the limit p(t) dtllncd hy 


i 1 for 


/-o, 


丈一 0. 


Thu* the UbU ia not continaoas for ^ « 0. Accordingly, for every fixed x w« 

^ n (*) J» »o that th« limit of -^(x) is not a distribution fnnetion. 

Id the c«m F n (x) * e (sc — n) cooBidered in 0.7, we ™ to thmt 

the ieqneDC« of cha»cUrit(ie fvnctioiii is never convergent, except when t U a mill* 
tiple of 2». Accordingljr, for every fixed x we F n (*) - • 0, §o tb»t the limit of 

l* n (x) U not • distribution fnnetion, m we have Already m«b in 6.7. 

98 



10.5 


10.S. Som« particulmr integrals. 一 We shall now deduce some 
formuUe that will be used in the sequel. The integrtA 

j e^^dx = Vn 
— 00 

is given in text books on Integral Calculus. Substituting xVhJ2 for x, 
we obtain for h> 0 

ao —^ 

J e^ hx9 dx = |X 
•- ® 

By means of 7.3 it is easily seen that we may differentiate anj 
number of times with respect to h, so that 
« 

(10.6.1) J = 4 (» = 0, 1, 2, • • 

— « 

Consider now the integral 

j e t{ ^^dx = J 


The partial sums of the series under tbe last integral are dominated 
by the function which is integrable over (— ao, oo). Thus 

by (5.5.2) we may integrate the series term by term and so obtain, 
since all terms of odd order eyidently vanish, 


J Jx ， e - “一 dx 

(膽 ）* 

=!># 

Taking here A = 1, and introducing the function 


(10.6.3) 


<P(x )= 


V2^ 



99 



10.5-6 


it follows that we have 


(10.5.4) 


J e itx dd>(x )« 



"idx 


a. 


Now (10.5.3) flhowR that 0(x) i8 a non decreasing and everywhere con¬ 
tinuous function, such that <p(— oo) = 0 and <P( 十 ao) == 1. Thus d>(x) 
is a distribution function, and then (10.5.4) shows that the corre- 

sponding characteristic function is e a . The distribution determined bj 
(D(ar) ia the important normal distribution ，that will be treated in Ch. 
17. — Bj repeated partial integration, we obtain from (10.5.4) the 
relation 

ao 

(10.5.5) f e (,l d<Dl B >(ar) = (-t<)"c*- 


We Bhall further consider the integral 

00 OO 

} J'e itx ~^ x ^ dx == J cos txe~ x dx 

— ao 0 


(10 5.6) 


t sin tx — cos tx 

1 -f f 


•M 1 


This expression mtj be regarded as the characteristic function corre¬ 
sponding to the frequency function f(x) — Since the charac¬ 

teristic function is integrable over (— oo, oo) > we obtain from (10.3.2) 
the reciprocal formula 

00 

( 10 . 57 ) 


10.6. The characteristic function of m distribution in R n . 一 If 

« = (/".. t n ) and * = (x lt . x n ) are considered as column vectors 
(cf. 11.2) corresponding to points in R ni we denote by t* x the product 
formed according to the rule (11.2J) of vector multiplication : 

t" X = t x X x -\ - -f tnX n . 

The definition (10.1.1) of the characteristic function of a one- 
dimensional distribution is then generalized by writing 


100 



( 10 . 6 . 1 ) 


W = goUi，.•”〜）= / e it ， a dP y 


10.6 


where P~P(S) is the probability function of a distribution in R n . 
The characteristic function (p(t) of the distribution is thus & func¬ 
tion of the n real variables Obviou»lj we always have 

少 (0, . . •，0) = 1， and for all values of the variables 

l9>(t)| $ 1， gP (— t) = gp(t). 

Further， g>(t) is everywhere continuous. If all moments of the distri¬ 
bution (cf 9.2) up to a certain order exiet, we have in the neighbour¬ 
hood of the point t = 0 an expansion of tp (t) analogous to (10.1.3). 

The following theorem, which is a direct generalization of the 
uniqueness theorem (10.3.1)，shows that a distribution in R n is uniquely 
determined by its characteristic function. 

If the interval I defined by the inequalities a 9 — h ¥ < x v < a v + h Vt 
(v = 1, . . n), is a continuity interval {cf 8.3) of P(S) } we have 

t t n 

(10.6.2) P(I) = lim -4 f f . . . dt„. 

T~*ee 71 J J i 

-t -r 


The proof of this theorem is a straightforward generalization of 
the proof of (10.3.1). — In the particular case when | gp (*) | is integrable 
over R n> we find as in (10.3.2) that the frequency function (cf 8.4) 
沪 F 

« - o 一 ==/(aJ., . . x n ) = f(x) exists and is continuous for all *, 

C/wCj . . . 0CCji 

and that we have 


(10.6.3) 


/(*)== 


( 2 々 


0O 00 

J f 


<p(t)dt l . . . dt n . 


The reciprocal formula corresponding to (10.1.4): 

ao eo 

(10.6.4) q>(t) = f f e il x f{x)dx x . . . dx n 


is obtained from ( 10 . 6 . 1 ) and holds whenever the frequency function 
f(») exists and is continuous, except possibly in certain points be¬ 
longing to a finite number of hypersurfaces in H n . 

We shall also want the following generalization of the theorem 
(10.3.3), which is proved in the same way as the one-dimensional case. 

101 



⑷吖 ... dtn . 


10.7. Continuity theorem for characteristic functions in 一 The 
continuity theorem proved in 10.4 may be directly generalized to 
multi dimensional distributions. By 8.5, a sequence of distributions 
in H n converges to a distribution when and only when the corre¬ 
sponding distribution functions converge to a distribution function. 
As in the one-dimensional case, it is often easier in the applications 
to solve the convergence problem for the corresponding sequence of 
characteristic functions, and in such situations the following theorem 
will be useful. 

We are given a sequence of distributions in H n , with the distribution 
functions F x (*), . • •， and the characteristic functions %(*)， <p${t), 

A necessary and sufficient condition for the convergence of the sequence 
{F»(*)} to a distribution function F(x) is that，for every t, the sequence 
[fp n (t)) converges to a limit q> (t) y which is continuous at the special 
point t = 0. 

When this condition is satisfied、the limit y(l) is identical with the 
characteristic function of the limiting distribution function F(x). 

The proof that the condition is necessary is quite similar to the 
corresponding part of the proof in 10.4, and uses the generalization 
of (7.5.9) to integrals in R n (cf 9.4). It then also follows that the 
limit g>(t) is the characteristic function of F(»). 一 In order to prove 
that the condition is sufficient, we consider a sub-sequence 
which converges (cf 8.5) to ft limit F(x) = F(x u . . x n ) that is non- 
decreasing and continuous to the right in each variable x，. We want 
to show that F(») is a distribution function, i. e_ that the corre¬ 
sponding non negative and additive set function P(S) is a probability 
function. For this purpose, it is sufficient to show that we have 
We then apply (10.6.6) to each (t), putting all the 

102 


10.6-7 

Let / f| ,denote the interval defined by the inequalities 
a ， 一 < a?* < a，+ <ev ， (v = 1, . . w). 

For any real a, and positive K we have 


ya I 

«nl 

c J - 

: 1": 

4.II 

A 


, 1/0 


6- 

10 . 



10.7-11.1 


a ¥ = 0. When fi tends to infinity, we obtain by the same argument 
as in 10.4 


h^nTuf' / -。 心 , 

0 0 




n n 




Allowing the h ¥ to tend to infinity, we then obtain, in perfect analogy 
with the one-dimensional case, 


so that the limit P(<S) of the sequence {P m/4 (5)} is a probability function. 
The proof is then completed in the same way as in 10.4. 

CHAPTER 11. 

Matrices, Determinants and Quadratic Forms. 

Th« subject of the preient chapter is treated in seTentl text-booka in «n elementary 
form well adapted for our purpose. We refer particularly to Aitken (Ref. 1), Bucher 
(R«f. S) f and for Se»odiiiAvUii readers to Bohr-MolUrap (Ref. 6). We shall here 
restrict oarselvec to give, for the convenience of (he reader, ft brief Murrey 一 in maoy 
eMei without complete proofs 一 of some fundamental definitioDa ud propertie* thal 
will b« nsed in the aaquel, adding fall proofs of certain tpeoial theortma not contained 
in the text-booki. 

II.!. Matrices. — ▲ matrix A of order m - 9t is a rectangular scheme 
of numbers or elements atk arranged in m rows and n coIuxuhb : 



flu 

a ll • 


A .=' 


a tt - 



. (hul 


• Omn 


We write briefly A — and when we want to emphasize the ordei 
of the matrix, we write A mn instead of A. We shall always assume 
that the elements att are real numbers. 



J 

1I7T 

” nl 




103 




11.1 


In the particular case when m = n =* 1, the matrix A oonusts of 
one single element a it1 and we shall then identify the matrix with the 
ordinary number a u . 

Two matrices A and B are called equals and we write A B f when 
and only when A. and B are of the same order, and all corresponding 
elements are equal: a.t = bit for all i and Jc. 一 We shall now define 
three kinds of operations with matrices: 

1. The product of a matrix A and an ordinary number c is defined 
as the matrix obtained by multiplying every element of A by c. Thus 
cA ^Ac = B^ where the elements of B are = can- When c = — l, 
we write 一 A instead of ( — 1) A. 

2. The sum of two matrices A and B ig only defined when the two 
matrices are of the same order. Then the sum C = A ^ B is defined 
as a matrix of the same order with the elements = aa + ba- 

3. The product of two matrices A and B is only defined when the 
first factor A is of order m - r, and the second factor B ig of order r • n, 
so that the number of columns of the first factor agrees with the 
number of rows of the second factor. Then the product C = ^4 U, or 
C mn = A mr B rn> is defined as a matrix of order m - with elements 
dk given bj the expression 

r 

Cik = 2 ai J - 

产 i 

rhe element in the t:th row and 蚤: th column of the product matrix 
is thus the sum of all products of corresponding elements from the 
5：th row of the first factor and the Ar:th column of the second factor. 

The three matrix operations thus defined are associative and distri¬ 
butive. Moreover，the two first operations are commutative, while 
jenerailj the third is non-comvnutative. Thus we have, e. g., 

(4 十 C = 4 + ( 月 + C>， (AB)C = A(BC), 

C(A 4 - B) = CA + CB^ (A + B)C = ^C + BC y 

A + J3 == JB -f yjl, c(A B) ^ cA. -f cB ) 

>ut generally not AB = BA t Even if both products AB and BA are 
lefined, they may be unequal. We are thus obliged to distinguish 
jetween premultiplication and postmultiplication. AB means A post- 
nultiplied by B t or B premuitiplied by 

104 



ll.I 


From these properties, it follows e. g that a* linear combination 
c x A x -f ••- + CpAp is uniquely defined as soon as all the A x are of 
the same order, and the terms may be arbitrarily rearranged. 
Similarly, the product D mn = i4 mr B ft C fU is uniquely defined, but here 
no rearrangement of the factors is allowed. The elements dhk of D 
are ^iven by the expression 

r • 

= 2 2 ° hi Cjk, 

i»i j-i 

The transpose of a matrix = | } of order m • » is a matrix 

A! - \aik) of order n • w, such that dn = an. Thus *the rows of A' 
are the columns of A % while the columns of A are the rows of A 
Obviously we have 

(Ay = 4 ， （ 4 十 B)，= 十 B ，， (v 4 B/ = B 

Any matrix obtained by deleting one or more of the rows and 
columns of A is called a submatrix of A. In particular every element 
of is a Bubmatriz of order 1-1, while the rows and columns are 
Bubmatriceg of order 1 n and m - 1 reHi>ectivelj. 

When m = n } we shall call A a square matrix. Owing to the associ¬ 
ative property of matrix multiplication, the powers -4 s , ... of a 

square matrix are defined without ambiguity. The elements a n , 
a tt ， ••” a nn of a square matrix form the main or principal diagonal 
of the matrix, and are called the diagonal elements. 

A square matrix which in ftymmetric&l about its main diagonal is 
called a symmetric matrix. A symmetric matrix is identical with its 
transpose, ao that we have A f = A or aki = a,k For an arbitrary 
matrix A — A mity it will be seen that the products A A and A A are 
symmetric, and of order m m and n • n respectively. 

A symmetric matrix with all its non diagonal elements equal to 
zero if* called a diagonal matrix. If A mn is an arbitrary matrix, and 
if D mm and D nn are diagonal matrices, the product D mm A mn ia obtained 
by multiplying the rows of A by the corresponding diagonal eUments 
of D, while the product A mn D nn in obtained by multiplying the 
columns of A by the corresponding diagonal elements of D. 

A unit matrix / is a diagonal matrix with all its diagonal ele¬ 
ments equal to 1. For any matrix A = A mn we have 

IA = >1 / = i4, 

105 



n.i-2 


where f denotes the unit matrix of order m m in the first product, 
and of order n n in the second. 

A matrix (not necesaarily square) having all its elements equal to 
zero is called a zero matrix 、 and is denoted by 0. 

11.2. Vectors. — A vector is a matrix consisting of one single row 
or one single column, and is called a row vector or a column vector, as 
the case may be. Thus a row vector x = {x x , . x n ] is a matrix of 
order 1 • w, while a column vector 

x — 

is of order n 1. In order to simplify the writing we shall, however, 
usually write the latter vector in the form x = . , x n ), indicating 

by the use of ordinary instead of curled brackets that the vector is 
to be conceived as a column vector The majority of vectors occurring 
in the applications will be of this kind. 

The transpose of the column vector x — (x^ . . , x n ) is the row 
vector * /= = {x,, , x„}, and conversely 

If * = (x t , , x n ) and y = (y,, , y n ) are two column vectors, 

the product xy is a matrix of order 1 1, i e. an ordinary number: 

(11 2.1) xy = x x y x 4- + x n y n - 

In particular for x = y we have 

XX = zf 4* • + Xn. 

The products xy and xx\ on the other hand, are not ordinary num¬ 
bers, but matriceH of order n n 

The vectors x lt , are said to be linearly dependent 、 if a 
relation of the form c, x l + + 0 exists, where the c t are 

ordinary numbers which are not all equal to zero. Otherwise 
are linearly independent. Similarly, p functions f u • one or 

more variables are said to be linearly dependent, if a relation -f- + 

-f CpJ], - • 0, where the c x are constants not all = 0, holds for all values 
of the variables. When several linear relations of this form exist, 
these are called independent, if the corresponding vectors c = (cj,..., c p ) 
are linearly independent. 



106 



11*3-4 


11.3. Matrix notation for linear traiuformationt . 一 A linear trans- 


formation 

X \ 

=«» yi + <*«. y» + 

••• + y», 

(11.3.1) 

A 

=a„ y, + a„ y, + 

… + a% % y n 、 


Xm 

= 0*11^1 + Omll/t + 

..• + (hnnffn 、 


establishes a relation between two sets of variables, x lf . . . t x m and 
y t 、• • ., y n% where m is not necessarily equal to n. The matrix A = 
= A mn = {a/t} is the transformation matrix. 

Now if * = (a?" • • • ， a? 爾 ） and y = (yu . . y n ) are conceived as 
column vectors, the rigbt-band sides of the equations (11.3.1) are the 
elements of the product matrix Ay, which is of order m l, i. e. a 
column vector. Thus (11.3.1) expresses that the corresponding elementn 
of the column vectors * and Ay are equal, so that in matrix notation 
the transformation (11.3.1) takes the simple form * = Ay. 


11.4. Matrix notation for bilinear and quadratic forms. 一 In the 
column vectors * and y of the preceding paragraph, we now consider 
the Xi and y* as two sets of independent variables, and form the 
product matrix xAy t where A = A mn = {a t k }. This is a matrix of 
order 11, i. e. an ordinary number, and we find 

(11.4.1) xAy = 斯扒， 

i，k 

where f = 1，2, . • m and A: = 1,2, . . ., n. Thus the bilinear form in 
the variables Xi and ^ that appears here in the second member has a 
simple expression in matrix notation. 

In the important particular case when m = w, A is 

symmetric, the bilinear form (11.4.1) becom©a^ 

i» 

(11.4.2) xAx = 2 a<Jk 

i, 

where = an. This expression is called a quadratic form in the vari¬ 
ables x" • . •, A，and will often be denoted by ^(*) or Q(x x , . . x n ). 
In matrix notation, we thus have Q(x) = x f ^fx. The symmetric matrix 
A is called the matrix of the form Q. If, in particular, ^4 == J, we 
have Q = xlx = »x = X I + * • • + Xn. 

107 




11.4-5 

The matrix expressions (11.4.1) and (11.4.2) are particularly well 
adapted for the study of linear transformations of bilinear and 
quadratic forms. Thus if, in the quadratic form , x»)= 

n 

= 2 豹 x*，new variables • • , y m are introduced by the linear 

i, 

transformation x = Cy, where C == C nmt the result is a quadratic 
form . ., Vm) in the new variables* 

in 

， a?„)= Qi (y x , • •» 3/m) = 

I, 4 = 1 

and tbe matrix ezpreBsion (11 4 2) then immediately gives 
Q = x Ax = yCA Cy = y'By, 

where B == C'AC. By transposition it is seen that this is a symmetric 
matrix, and thus the matrix of the transformed form is CA C. The 
order is, of course, m ■ m. 

11.5. Determinants. — To ever) T square matrix A = A nn = {a**} 
corresponds a number A known the determinant of the matrix, 
which is denoted 

- ■ flln 

^4=|^| = |a,i| = - a2n 

0nl dn2 . . Ann 

The determinant is defined sih the Hum 

^ ^ 2 ~ ° lr * ° 2r * Qn r n » 

where the second subscripts r lt . . , r n run through all the n! possible 
permutations of the numbers 1,2,. ， w, while the sign of each term 

is + or — according as the corresponding permutation is even or 
odd. The number n is called the order of the determinant. 

The determinants of a square matrix A and of its transpose A r 
arc equal: A =： A\ If two rows or two columns in A are interchanged, 
the determinant changes its sign. Hence if two rows or two columns 
in A are identical, the determinant is zero. If A, B and C are square 
matrices such that AB — C, the corresponding determinants satisfy 
tbe relation AB ■= C. 


108 



11.5-6 


When A is an arbitrary matrix (not necessarily square), the deter¬ 
minant of any square stibmatriz of A is called a minor of A. When 
A is square, a pnncipal minor is a minor, the diagonal elements of 
which are diagonal elements of A. 

In a square matrix A == {a^} y the cofactor A a of the element a X k is 
the particular minor obtained by deleting the «:th row and the k:th 
column, multiplied with (— We have the important identities 


(11.5.1) 

(11.5.2) 


" . (A fori = k, 

2 —= { 0fori>i ， 


i=i 


,=1 10 for i ^ k, 


and further 

n 

(11.5.3) A = a n A n — 2 k 

t,&=：a 

where An a is the cofactor of a,i in A n . 


11.6. Rank. 一 The rank of a matrix A (not necessarily square) is 
the greatest integer r such that A contains at least one minor of 
order r which is not equal to zero. If all minors of A are zero, A is 
a zero matrix, and we put r == 0. When A = A m n , the rank r is at 
most equal to the smaller of the numbers m and n. 

Let the rows and columns of A be considered as vectors. If A. is 
of rank r, it is possible to find r linearly independent rows of A t 
while any r + 1 rows are linearly dependent. The same holds true 
for columns. 

If A u A t1 . A p are of ranks r 1} r iy . the rank of the sum 

A x + + A p is at most equal to the b-um r x + + r p , while the 

rank of the product ... A p is at most equal to the smallest of 
the ranks r lf . . r p . 

If a square matrix A = A„ n is such that A 〆0， then A is of 
rank n. Siich a matrix is said to be non-singtilar, while a square 
matrix with A = 0 is of rank r<n and is called a singular matrix. 
If an arbitrary matrix B is multiplied (pre- or post ) by a non-singular 
matrix A y the product has the same rank as B. When the matrix of 
a linear transformation is singular or non-singular, the corresponding 
adjectives are also applied to the transformation. 

109 





If 乂 is sjmmetric and of rank r, there is at least one principal 
minor of order r in A which is not zero. Hence in particular the 
rank of a diagonal matrix is equal to the number of diagonal elements 
which are different from zero. 

n 

The rank of a quadratic form Q = xA x == ^ x t Xk is, by defini- 

*, *=i 

tion, equal to the rank of the matrix A of the form. According as 
^ is singular or non-singular, the same expressions are used with respect 
to Q. A non-singular linear transformation does not affect the rank 
of the form. If, by such a transformation, Q ia changed into 

r 

2 %l ^ *» w ^ ere A 〆 0 for « = 1, 2, . . ., r, it follows that Q is of rank r. 

l 

The rank is the smallest number of independent variables, on which 
Q may be brought by a non-singular linear transformation. 

A proposition which is often useful is the following: If Q may be 
written in the form Q = L} + . + L^ y where the Li are linear func¬ 
tions of Xi, . . ., , and if there are exactly h independent linear 

relations (cf 11.2) between the L it then the rank of Q is p — h. It 
follows that, if we know that there are at least h Huch linear relations, 
the rank of Q is Sp — h. 

11.7. Adjugate and reciprocal matrices. 一 Let A = \a t k) be a 
square matrix, and let as before An denote the cofactor of the element 
If we form a matrix {山*} with the cofactors as elements, and 
then transpose, we obtain a new matrix ^4* = j aft}, where a?* = An. 
We shall call A* the adjugate oi A. By the identities (11.5.1) and 
(11.5.2) we find 

A 0 . ..0 

(11.7.1) AA* = A* A =r= : Al — ■ 0^4...0 . 

0 0 … （ 

For the cofactor A?k of the element an == Ain in we have 
(H.7.2) .4?* = 

This is only a particular case of a general relation which expresses 
■ny minor of A* in terms of A and its minors. We shall here only 
quote the farther particular case 


110 



11.7-8 

(11.7.3) , 11 = A n An — An An = AAn.tk. 

Mh A(k 

When A is non-singular, the matrix A^ 1 = 士 ~ |*^| w called 
the reciprocal of A. We obtain from (11.7.1) 

(11.7.4) AA-^ 

The matrix equations AX=：I and XA=^I then both hare a unique 
solution, viz X = A-\ It follows that the determinant of 4 一 1 ia A~\ 
Further so that the relation of reciprocity is mutual. 

The transpose of a reciprocal is equal to the reciprocal of the trans- 
P 08e: (*4 一 = ( 乂 ’) 一 1 . For the reciprocal of a product we hare the 
rule (AB)~ l = B _1 A~ l . 

When A is symmetric, we have Am == Aa y so that the ad jugate 
and the reciprocal A~~ l are also symmetric. The reciprocal of a 
diagonal matrix D with the diagonal elements d" • • •，is another 
diagonal matrix D~ l with the diagonal elements d ? 1 ，• . . ， d: 1 . 

If Q ― xAx is a non-singular quadratic form，the form Q~ l = 
= xA~ l x is called the reciprocal form of Q. Obviously ⑷一 *) 一 1 = 弘 

Let x = (xj y . . ., x n ) and t , . . ., ^ n ) be variable column vectors. 

If new variables y = (y, , . . ., y m ) and u = (u lf are introduced 

by the transformations 

(11.7.5) y = C*, t = Cu. 
where C — C mnt we have 

(11.7.6) t f x = uCx = uy. 

The bilinear form t f x = t t x t + … + x n i» thus transformed into 

the analog^ouB form u y = «, y x + • • • -f u m y m in the new variables. Two 

•eta of Tariables Xi and U which are transformed according to (11.7.5) 
are called contragredient aeta of variables. In the particular case when 
m=^n and C is non-singular, (11.7.5) may be written 

(11.7.7) y = C*, tt = (CT) - if. 

11 . 8 . Linear equations. 一 We shall here only consider some 
particular cases. The non-homogeneous system 

111 



11.8-9 

flu 怎 ：1 + A ， X 蠶 + . . . + d\ n Xh ==s 

(u.8.1) . 

dn\X x + a n %X t + … + OnnXn = kn, 

is equivalent to the matrix relation Ax = h 、where A = {a <Jk }, x = 
—(a?i, . . ., and h = (h ly • • . ， K). If A is non singular, we may 
premultiply both sides by the reciprocal matrix 上 ' and so obtain 
the unique solution * = A~ l h t or in explicit form 

i n 

(11.8.2) Xk = ~j hj Aik (A; = 1,2, . . w). 

*-i 

Thus Xk is expressed by a fraction with the denominator A and the 
numerator equal to the determinant obtained from A when the ele¬ 
ments of the A:th column are replaced by the second members \ 、 … 、 hn. 
This it the classical solution due to Cramer (1750). 

Consider now the homogeneous system 

fltll 怎 1 十 Aj X 費 + . . + U\n OCn ^ 0, 

(11.8.3) . 

ami x x 4- a w3 a;, + •• -f a mn x n = 0, 

or in matrix notation Ax = 0, where m is not necessarily equal to n. 
By 11.6, the matrix A is of rank r ^ n. If r == «, the system (11.8.3) 
has only the trivial solution * = 0. On the other hand, if r < w, it 
is possible to find n 一 r linearly independent vectors c,, . . ., c n _ r 
such that the general solution of (11.8.3) may be written in the form 
x =^t { c x 十 . . .+ tn-rCn-n where the U are arbitrary constants. 


11.9. Orthogonal matrices. Characteristic numbers. — An ortho¬ 
gonal matrix is a square matrix C = {c,*| such that CC # = /. Hence 
C* = 1, so that the determinant C = | C | = + 1. Obviously the trans¬ 
pose C of an orthogonal C is itself orthogonal. Further C"* 1 = C\ 
and thus by the deBnition of the reciprocal matrix Ca = Cca for all 
i and k y and hence bj the identities (11.5.1) and (11.5.2) 


(11.9.1) 

2 Ci i Ck J 

(11.9.2) 

n 


112 


for i = k, 
for i k, 

for i = Jc t 
for i Je. 





11.9 


The product C, C, of two orthogonal matrices of the same order is 
itself orthogonal. 一 If any number p <n oi rows ca,c l2 , . . .» c, n 
(t == 1,2, . . ., p) are given, such that the relations (11.9.1) are satisfied, 
we can always find n — p further rows such that the resulting matrix 
of order n n is orthogonal. The same holds, of course, for columns. 

The linear transformation x = Cj, where C is orthogonal, is called 
an orthogonal transformation. The quadratic form x x = 工 ! 十 . -f x % n is 
invariant under this transfonnation, i. e. it is transformed into the 
form yC Cy = yy = yJ + .十 yS，which has the same matrix l . — 
The reciprocal transformation y = C~' x is also orthogonal, since 
C" 1 = C' is orthogonal. 

The orthogonal transformations have an important geometrical 
significance. In fact, any orthogonal transformation may be regarded 
as the analytical expression of the transformation of coordinates in 
an euclidean space of n dimensions which is effected by a rotation of 
a rectangular system of coordinate axes about a fixed origin. The 

distance (x\ + - + from the origin to the point (a;,,... ， x，）is 

invariant under any such rotation. 

If A is an arbitrary symmetric matrix, it is always possible to 
find an orthogonal matrix C such that the product C f AC is a dia¬ 
gonal matrix: 

x, 0 ... 0 

(11.9.B) CAC = K = - 0x > ■ ■ 0 .. 

0 0. x n 

Any other orthogonal matrix satisfying the same condition yields the 
same diagonal elements x Ll . . x ny though possibly in another arrange¬ 
ment. The numbers x lf . . x, lt which thus depend only on the matrix 
A % are called the characteristic numbers of A. They are the n roots 
of the secular equation 

X (2j| . . . flin 

(11.9.4) \A - x/| = a «t ~ x • ■ ■ <J, n = o, 

Cln 1 Cln 4 ! • . . Ann — X 

and are all real. Since C is non-singular, A and K have the same 
rank (cf 11.6). Hence the rank of A is equal to the number of the 
roots xt which are not zero. From (11.9.3) we obtain, taking 1 the 
determinants on both sides and paying- regard to the relation C* = 1, 

113 




11.9-10 

(11.9.5) -= X, X g . . . Xn. 

If A is non-singular, the identity 

(11.9.6) \A~ l ~ xI\ = (—x) n A-'\A -^-I\ 

shows that the characteristic numbers of A'' 1 are the reciprocals of 
the characteristic numbers of A. 

Finally, let JB be a matrix of order m • ”，where m ^ n. If B is of 
rank tn, the symmetric matrix BB' of order m . m has all its charac¬ 
teristic numbers positive. It follows, in particular, that BB r is pon- 
singular. — This is proved without difficulty if, in (11.9.3)，we take 
BB f and express an arbitrary characteristic number x, by means of 
the multiplication rule. 

H.10. Non-negative quadratic forms. — If, for all real values of 
the variables x x , ., x n , we have 

n 

Q ( 欠 i, . • . ， 怎 n) ^ik ^ 

I, 禽 =1 

where Okt = 山 *， the form Q will be called a non-negative quadratic 
form. If, in addition, the sign of equality in the last relation holds 
only when all the x, are equal to zero, we shall say that Q is definite 
positive. A form Q which is non-negative without being definite positive, 
will be called semi-definite positive. Each of the properties of being 
non-negative, definite positive or semi-definite positive, is obviously 
invariant under any non-singular linear transformation. 

The symmetric matrix A = {a^} will be called non-negative, definite 
positive or semi-definite positive, according as the corresponding 
quadratic form Q = x Ax has these properties. 

The orthogonal transformation * = Cy, where C is the orthogonal 
matrix occurring in the special transformation (11.9.3), changes the 
form Q into a form containing only quadratic terms: 

(11.10.1) Q(x ly • • ., a ： n) = x, yj + x f yj + + x n yi, 

or in matrix notation x Ax where the are the characteristic 

numbers of A t while K is the corresponding diagonal matrix occurring 
in (11.9.3). By the same orthogonal transformation, the form Q — 
x(a:J -f- 4-xi) is transformed into (x 1 — x)yj + + (x n — If x ^ 


114 



11.10 


the smallest ch&r&cteriatic number of the last form is obviously 
non-negative, and it follows that the form Q — x(x) + 十 xi), with 
the matrix A — x/ t has the same property. 

If the form Q is definite poiitire, the form in the second member 
of (11.10.1) has the same property, and it follows that in this case 
all the characteristic numbers x, are positive. Hence by (11.9.5) we 
have A > 0, so that A is non-singular. 

If, on the other band, Q is semi-definite positive, the same argu¬ 
ment shows that at least one of the characteristic numbers is zero, 
so that A = 0. If Q is of rank r, there are exactly r positive charac¬ 
teristic numbers, while the n — r others are equal to zero. In this 
case, there Are exactly n—r linearly independent vector® x p = ( 中， ... ， 々)） 
such that Q (x p ) = 0. 

The geometrical ■igniflcance of the orthogonal transformation 
considered above is that, bj a suitable rotation of the coordinate 
sjstem, the quadric Q(x ly . . ., x n ) = con«t. is referred to its principal 
axes. If Q is definite positive, the equation Q = const, represents an 

ellipsoid in n dimensions, with the semi-axes For semi-definite 

forms Q, we obtain various classes of elliptic cylinders. 

If Q is definite positive, any form obtained by putting one or 
more of the Xt equal to zero must be definite positive. Hence any 
principal minor of Q is positive. For a »emi-definite positive Q t tbe 
same argument shows that any principal minor is non-negative. — It 
follows in particular that if, in a non-negative form Q, the quadratic 
term xi does not occur, then Q must be wholly independent of X{. 
Otherwise, in fact, the principal minor — ah would be negative for 

tome k. — Conversely, if the quantities A, A 1U A n n »• • • > An n . n-l, n~l 
are all positive, Q is definite positive. 

The sabstitution * = A~ x y changes the form Q = x A x into the 
reciprocal form Q~ l =^yA-~ l y. Thu* if Q is definite positive, so is 
and conyersely. This can also be seen directly from (11.9.6). 一 
Consider now tbe relation (11.5.3) for a definite positive symmetric 
matrix A. Since anj princip&l sabmatriz of A is also definite positive, 
it follow* th&t the last term in the second member of (11.5.3) is a 
definite positifc quadratic form in the variables .... am, so that 
we have 0 < A ^ a n A lx , and generally 

(11.10.2) 0 < (HiAn (t = 1, 2, ••• ， w). 

By repeated application of the same argfument we obtain 

115 



11 , 10-11 

(11.10.3) 0 A. ^ 

The sign of equality holds here only when 4 in ■ diagonal matrix. — 
For a general non-negative matrix, the reUtion (11.10.3) holds, of 
course, if we replace the sign < by 


11.11 • Decomposition of 2 x ** — certain Btatiatical applications 

i 

we are concerned with various relations of the type 


( 11 . 11 . 1 ) 


2而’ = 弘 + •十必， 


where Q{ is for t_ = 1 ， 2, • . • ，耒 ， a non-negrative quadratic form in 
x l} . . . y Xn of rank r<. 

Consider first the particular case Jc = 2 t and suppose that there 
exists an orthogonal transformation changing Q x into a «um of r x 

Tl 

squares : ^ 2 y *' ^PP^y> n 8T this transformation to both sides of 

l 

n 

(11.11.1) ， the left-hand side becomes 2 y*» and it follows that Q t is 

l 

n 

changed into 2 y*- Thus the rank of Q % is r, = » — r 1% and all its 

characteristic numbers are 0 or 1. 一 As an example, we consider the 
identity 

n n 

(11.11.2) ^ nA% + 2( x< — 龙) •， 


1 

where x = —^X{. Any orthogonal transformation y = C* such that 

n l 

the first row of C is y=. } …, y= y will change the form «rc* = 


(畀 + 壳 + . 


+ ^f into y：. 


Thus the same transformation 


changes 2 ( 的一无 ) , into 2 W • In the decomposition of 2 x * ac * 
i a i 

cording to (11.11.2)，the two terms in the second member are thus of 
ranks 1 and n — 1 respectively. 


116 



n.ii 


Consider now the relation (11.11.1) for an arbitrary it > 1. We 
shall prove the following proposition due to Cochran (Ref. 66; cf 
also Madow, Ref. 154): 

k 

J/2n = »，there exists an orthogonal transformation x = Cy chang- 

l 

ing each Qt into a sum of squares according to the relations 

r t r,+f, n 

= 2 = 2 * * * * * = 2 * 

1 f!+l 餺 

i. e. such that no two Qi contain a common variable yi. 

We Bhall prove this theorem by induction. For k = l t the truth 
of the theorem is evident. We thus have to show that, if the theo¬ 
rem holds for a decomposition in A ： — 1 terms, it also holds for k 
terms. In order to show this, we first apply to (11.11.1) an orthogonal 

T\ 

transformation * == C x z changing Q} into 2 x < ^* - This gives ub 

i 

id 一 x<) w + 21 W = 必 + . + d 

1 r t -H 

where , . . ., Qi denote the transforms of Q t1 . . Qk- We now 
assert that all the are equal to 1. Suppose, in fact, that p of the 
X{ are different from 1, while the rest are equal to 1. Both members 
of the last relation are quadratic forms in The rank of 

the first member is n — r, -f p, while by 11.6 the rank of the second 
member is at most equal to r % + + r* = n — rj. Thus ^ = 0, and 

ail x/ = 1, so that we obtain 

(11.11.3) 2^*=^ + Ci- 

r,+l 

Here, the variables 心 ,• . . ， <ev do not occur in the first member, and 
we shall now show that these variables do not occur in any term in 
the second member. If, e. g .， Q\ would not be independent of e ly then 
by the preceding paragraph must contain a term with c > 0. 
Since the coefficients of z\ in d . • .，Qi are certainly non-negative, 
this would, however, imply a contradiction with (11.11.3). 

n 

Thus (11.11.8) ^ives a representation of m a sum of A; — 1 

r»+l 


117 



11 . 11-12 


non negative forms in <3V,+i ，By hypothesis the Cochran theo¬ 
rem holds for this decomposition. Thus there exists an orthogonal 
transformation in n 一 r x variables, replacing ^r, + i, • . •, by new 


variables y r ,+i,. 

, 、 y n such that 



r, + r. 

n 

(11.11.4) 

Q 2 = 2 y, ’ ， 

，认二 2 v*’ . 


r, + l 

n - r i- + 1 


If we complete this transformation by the r, equations 2 r — y x , ， 

z rx — y f|1 we obtain an orthogonal transformation in n variables, « — C»y, 
such that (11 114) holds. 

The result of performing- successively the transformations x C { z 
and * — C s y will be a composed transformation x ™ C { C 9 y which is 
orthogonal, since the product of two orthogonal matrices is itself 
orthogonal. This transformation has all the required properties, and 
thus the theorem is proved. 

Let us remark that if, in (11.11.1), we only know that every Q t is 
non-negative and that the rank of Q t is at most equal to r,, where 

k 

2 r * ~ we can at once infer that Q t is effectively of rank n, so that 

l 

the conditions of the Cochran theorem are satisfied. In fact, since 
the rank of a sum of quadratic forms is at most equal to the sum 
of the ranks, we have, denoting by r: the rank of Qi, 

k k 

w ^ 2 r * — 2 r, = w, 

i l 

Thus 2 r * ~ 2 ru and r\ ^ r/. This evidently implies r\ = r, for all i. 

We finally remark that the Cochran theorem evidently holds true if 、 
in (11.11.1), the first member is replaced by a quadratic form Q in any 
number of variables which, by an orthogonal transformation, may be 

n 

transformed into ^ A Xj. 


11.12. Some integral formulae. 一 We shall first prove the im¬ 
portant formula 


(11.12.1a) 



乂 * dXi 


dx n 


(2tt)3 _ 

VT e 




118 



or in ordinary notation 


11.12 


(11.12.1b) 




， *») 

d %]/j . . . d 3Cn 


(2 

VT 


€ 


^«) 


where Q in 9, definite positive quadratic form of matrix A, while 
* = (<”•• ，匕 ） is a real vector. As in the preceding paragraphs, A is 
the determinant \A\, while Q~ l is the reciprocal form defined in 11.7. 
— For w = 1, the formula reduces to (10.5.2). 

in order to prove (11.12.1 a) we introduce new variables y = 
=(^i, . . ., y n ) by the substitution * = Cy, where C is the orthogonal 
matrix of (11.9.3), so that CA C = JC, where K is the diagonal matrix 
formed by the characteristic numbers x ； of A. At the same time we 
replace the vector t by a new vector w = (w,, . . ., u n ) by means of 
the contragredient substitution (cf 11 7.7) t = (C) -1 u, which in this 
case reduces to t = Cu ，since C is orthogonal. By (11.7.6) we then 
have t f x = uy. Denoting the integral in the first member of (11 12 1 a) 
by J, we then obtain, since (7= ± 1, 


J- 


QB GO 

f f 


e iu' y - iy 'Ky rfy, . . . dy n = J| / dy,. 


% 


Applying (10.5.2) to every factor of the last expression, we obtain 




(2 nf 




(W 


-JttH 


X J 


V X, X, . . . Xn V 

since b, 11.7 the diagonal matrix with the diagonal elements 

identical with the reciprocal K~ l , while bj (11.9.5) we have A 
=x, x t ... x n . We have, however, K^ 1 == (C f A C)"* 1 = C 續 1 A~~ l (C’) 一 1 
= C f A~ l C, since C is orthogonal. Hence uK^ 1 u = u!CA~~ x C u 
=t, and thus finally 

n 

厂_ ( 2 充)， r-jt'A-'t 

1 


is 


119 



11.12 


i. e. the formala (11.12.1a). 一 Putting in particular e == 0, we obtain 
the formula 

( 11 . 12 . 2 ) f J .’ , " 1 …❹* = (^^. 

—• —« 、 

This holda even for a matrix A with complex elements, provided 
that the matrix formed by the real parts of the elements is definite 
positive. 

We farther consider the integral 

V = f … j dx x ... dxn 、 

Q(*i, ...，*»)<<■ 


which represents the n-dimenaional »yolume> of the domain bounded 
by the ellipsoid Q = c 1 . The orthogonal transformation used above, 

followed by the simple aabstiintion y^== y=. ti, shows that we hare 


V 



dsi • • • ds，• 


The last integral represents the volume of the n-dimensionai »unit 


sphereand it will be shown below that its value is 



,ao that 


(11.12.3) 


F = 


n 

7^ 


r (^ + 1 ) 


•FT 


We shall finally require the value of the iniegnral 
B ik = / / x i x k dx l . . . dx ny 

Q<«* 

extended over the same domain as the integral V. Making the same 
substitutions as in the case of V, we find by some calculation that the 
matrix B with the elements Bit is 


B^g n CK-^C^g n A-\ 


where 


120 



11.12 


9n 



: W<1 


z\ de 、 … ie % . 


It will be shown below that we have 


so that 
(11.12.4) 


c" +1 »» c* V 


Bn 


c，F Ati 
n + 2 A 


The DirichUt integral% used above: 

j x = f .dzn and j ， *= J* • • • f z\ At x ...dtn, 

n 

extended over the n-dimenBional unit sphere < 1, cmn be calcnUted by moftni 

l 

of the transformation 


£ x = cos <p lt 

z t = Bin 妒 , cob fp t9 

z 9 = sin (p x sin <p t cos ip 9 , 

Xn ■■囂 in . sin cos tp n , 


which establishes a one-to-one correspondence between the domains Yi g i an ^ 
0 < 外 < (i «=* 1,2,..., n). The Jacobian or the transformation is ( 一 l) n (sin %)*». 
(sin ... sin tp n . With the aid of the relation 



2 

y)n «=* 2y*(sin 9 ?) n =* 

o 





which is proved by sabatitating x » ain s q> and using (12.4.2), we then obtain 


Vt) n dq>i ...fnin y n rfy n - ~^ ， 

n 

it ft n 

U m f (»in ^i ) 11 <»•* Vidtptf / »in - 


121 




12.1 


CHAPTER 12. 

Miscellaneous Complements. 


12.1. The symbols 0, o and oo. — When we are investigating 
the behaviour of a function f[x) as x tends to zero, or infinity, or 
some other specified limit, it is often desirable to compare the order 
of magnitude of f(x) with the order of magnitude of some known 
simple function g (x). In such situations we shall often uae the fol¬ 
lowing notations. 

f( x \ t e 

1) When remains bounded as x tends to its limit, w© write 

9(x) 

f(x) == 0(g{x1), which may be read: >/(x) is at most of the order g(x)». 

2) When — tends to zero, we write f(x) == o(g{x)) y which may 

9(^) 

be read: */(x) is of & smaller order than g(x)». 

3) When tends to unity, we write f(x)^>g(x) y which may be 

9\ x ) 

read: »/(x) in asymptotically equal to g{x)». 

鵞 

Thus a« a: — > oo we have e.g. ax -b b = 0 (x) y x n — o (e*), - + 

Symbols like 0(x) y o(l) etc. will often be used without reference 
to a specified function f(x). Thus e. g. 0(x) will stand for »any func¬ 
tion which is at most of order x», while 0(1) signifies *anj bounded 
function», and o(l) »any function tending to zero*. 

At a further example we consider a function f{x) which, in some neighbourhood 
of 2 麵 0 ， hat n continuous deriTAtiTes. We then have the M»c Laurin expansion 


/( 和 *，+ ■«” w ， 

o 

where 

H 1 

Now bj bjpotheaia x) — /(” ）（0 ) tends to zero with x. According to th« *boT® 
we m 纛 j thna write, m x tends to zero, 


/(*)*= + o (ar"). 

o 

Tbl« rcUtion, whieh holds wh«n f(x) is cool^lex, hM already been used ia 

( 10 . 1 . 8 ). 


122 



12.2 


12.2. The Euler - MacLaurin sum formula. 一 We define a sequence 
of auxiliary functions P x (x) t P # (x), ... by the trigonometric expansions 


.. ^ COS iVTCX 

Plk\pC) == 2j (y n )U 

»== l 

.v ^ sin 2 
iWi (X) = 2 j 2 at (y7r) 9l:Ti • 


( 12 . 2 . 1 ) 


All these functions are periodical with the period 1, so that 

Pn(x + 1) == Pn (x). 

For w > 1, the series representing P n (a;) i8 absolutely and uniformly 
convergent for all real x，so that P»(x) is bounded and continuous 
over the whole interval (— co } co). 

The series for P x (x), on the other hand, is only conditionally con¬ 
vergent, and it is well known that we have Pj W = — a; 十 | for 
0 < x < 1. Denoting by [a;] the greatest integer ^ it follows from 
the periodicity that we have for all non-integral values of x 

Px W = [x] — x + 

Thus every integer is a discontinuity point for Pi (a?), and we have 
]P, (rr) I < i for all x. 

For integral values of x we have 

•PatW = 2yJt ^ (- 1 ^~ 1 (2*)1 ， 

Pa*+i W *= 0. 

The numbers appearing here are the Bernoulli numbers defined by 
the expansion 

(12.2.2) = 

0 

We have 

B 0 = 1, B y =— 1, B t = i , 5 4 = — ？0，丑《 =占， …， 
while all the B v of odd order ^ 3 are zero. — For w > 1 we have 

7# 

/p”(x) = (—l)”- • 一 “ X). 


123 



12.2 

For n> 2 this relation holds for all x, while for n = 2 its validity 
is restricted to non-integral values of x. 

Consider now a function g(x) which i& continuous and has a con¬ 
tinuous derivative g (.r) for all x in the closed interval (a -¥ n y h } a + h) y 
where a and h > Q are constants, while and n t are positive or 
negative integers. For any integer v such that n x ^ v < w, we then 
find by partial integration 

*+i 

h J P, [x) g [a 4 - hx)dx = — \y{a + vh) 一 J g(a-\- (v+1)fc)4 - jg(a-¥hx)dx. 

費 * 

Hence we obtain, summing over y = n,, . . w g — 1, 


*4 fif 

2 夕 (a + kv) = Jg (a + 十 i i g(a + n M h) — 

n x «i 


(12.2.3) 


—h J P 1 (x)g (a 4 - hx) dx. 


This is the simplest case of the Euler-MacLaurin sum formula, which 
is often very useful for the summation of series. If g(x) has con¬ 
tinuous derinatives of higher orders, the last term can be transformed 
by repeated partial integration, and we obtain the general formula 


n, n, 

2 分 (a 十 hv) = J g(a -f- hx)dx + i g(a n l h) 4 - i 夕 (tf 卞 w ， A) — 

n, «i 

(12.2.4) + + »,h)} + 

+ ( — 1)* +1 A 2,+ l J P^i+i (x)g (u+1) (a 4- hx)dx y 
»» 

where s may be any non-negative integer, provided that all derivatives 
appearing in the formula exist and are continuous. 


If 夕 (a + hv) and j g (a hx)dx both converge, we obtain from 

— oft —co 

the formula (12.2.3) 


(12.2.5) ^g[a + hv)^ Jp(a + hx)dx — h JP^g (a -f hx)dx ) 



124 



12*2-3 


where the lMt integral mast also conTerjfe. If, in addition, 穿 0 
m x-*± oo for y = 1, 2, . . we obtain from (12.2.4) 

OB 

(12.2.6) 2 ^( a + ^ I， ) =s= 

to • 

=J g(a + hx)dx + (— l) a+1 h it+1 J (a?)y ta * +l) (fl + hx)dx. 

If, in (12.2.8), we take ^ (*) 08 ~» a ■■ 0, A « 1, ” 直 ■■ 1 and n* 麵 n,obtain 

2 ； a * logn ' i '*' f s^' f / ^^ dx ' 

l l 

From the definition of Px{x), it is easily seen tbftt 

0< / 学打 < 4 , 

n 

2 ^ log n + C + & + 0 , 

l 

00 

C=J + J dx = 0.5772 ... 

l 

is known as Euler'i constant. 

12.3. The Gamma function. — The Gamma function r(p) is de¬ 
fined for all real j? > 0 by the integral 

00 

(12.3.1) ' r[p) = j xf~ l e~ x dx. 

U 

By 7.3, the function is continuous and has continuous derivativea of 
all orders: 

00 

r^ r) (p) = f x p ~ l (log x) r e^dx 
0 

for any p > 0. When p tends to 0 or to 十 co, F(p) tends to + ». 
Since the second derivative is always positive, r(p) has one single 
minimum in (0, oo). Approximate calculation shows that the minimum is 
situated in the point p 0 = 1.4616, where the function assames the 
value r(po) = 0.8856. 


so that we have 

(12.2.7) 

where 


125 



12^-4 


Bj a partial integ^tion, we obtain from (12.3.1) for any p > 0 

r( P -hi)= P r(p). 

When p is equal to a positiye integer n t a repeated use of the la«t 
equality gives, since r(l) = 1, 

r(n + l) = n! 

From (12.3.1) we further obtain the relation 

(12.3.2) J x 1 - 1 e-« = 

0 

where a > 0, A > 0. If we replace here a by a ^ it and develop the 
factor e^ itx in series, it can be shown that the last relation holds 
true for complex values of a, provided that the real part of a is 
positira. 1 ) 

Bj (12.3.2), the function 

(12.3.3) /(,; a.A) 

( 0 for a: ^ 0 , 

has, with respect to the variable x, the fundamental properties of a 
frequency function (cf 6 . 6 ): the function is always non ne^atire, 瓢 nd 
its integral over (— 00 , 00 ) is equal to 1. The corresponding distribu¬ 
tion plays an important r61e in the applications (cf e. g. 18.1 and 
19.4). It haa the characteristic function 


<0 ao 

J ^ t9 f(x; a, jl) dx = J* x x ^ 1 e~ la ~ ,t)z dx : 


(12.3.4) 


« A r(x) 

r ( 又 ) (a—it) 1 ' 


12.4. Th® Beta function. 一 The Beta function B(p y q) is defined 
for all real p > 0 , 5 > 0 by the integral 

*) A reader Mqoftioted with Caachj'i theorem on complex integration will be 
»bl« to deduce th« Talidity of (12.3.2) for complex a bj a simple applicftUon of that 
theorem. 


126 



(12.4.1) B (p ， q) = f ^ _, (1 —x)^ x dx. 

0 


We shall prove the important relation 


(12.4.2) 

The integral 


B(p, q)= 


r(p)r(q) 
r(p + q) 


f <P+9-l x p-i e -l(l + r) dx = r (p) (,-1 e -t^ 
0 


12.4 


regarded as a function of the parameter t, satisfies the conditions of 
the integration theorem of 7.3 for any interval (c, oo) with e > 0, so 
that we have 

ao ao ao 

^(p) f t <l '~ l e~' t dt = f dxf x p ~~ l e^ t[l+9) dt. 

t 0 c 

When i tends to zero, the first member tends to r (p) F(g). In the 
second member, the integral with respect to t tends increasingly to the 

/pp—i 

limit r(p 十 g)(] + :) …， which is integTable with respect to a; over 
(0, oo )• According to (5.5.2) we then obtain 

0» 

r( P )r( q )^r( P + q )f^f^ dx . 


Introducing- the new variable y == in the integral, we obtain the 

relation (12.4.2). 

Taking in particular j? = g in (12.4.2) we obtain, introducing the 
new variable y == 2a; — 1, 

" i i 

(12.4.3) = j ^-. (1 _ x y-^dx^ 2 1 — 1 ， J(l- yy-^df. 

0 0 

For p = i this gives 

i 

^*(i) J ^ r (i) = ^. 

o 

127 



12.4-8 


On the other hand, potting in (12.4.3) y % — z, we obtain 


rVp). 

r(2j>) 


2 1 - 扑 / (1 -z)P- 1 s-ide = 2 l ~ 1 >' 


r(p)r(j) 
r( P + *) ’ 


(12.4.4) 


oap-i 

r(2p) = Y^r(p)r(p + k). 


If we define a function p(x; p, q) by the relation 

(12.4.5) §{x\p,q)^ XV ~ X (1 ~ X)，， ~ l 

for 0 < a; < 1, and put P(x;p t g) = 0 outside that interval, it follows 
from (12.4.1) and (12.4.2) that this function has the fundamental pro¬ 
perties of a frequency function. The corresponding distribution, which 
has its total mass confined to the interval (0,1), will be further dis- 
caased in 18.4. 


12.5. Stirling's formula. — We now proceed to deduce a famous 
formula due to Stirling, which gives an asymptotic expression for r(p) 
when p is large. We shall first prove the relation 

(12.5,” r W=iL m „ 

for any p > 0. 

By repeated partial integration we obtain 



o 


x p ~ l 




n\ n p 

p(p + 1) ... (p + «) 


The first member of this relation may be written aa J g (x, w) dx y 

o 

g (x, n) = x p ~ l (1 一 for 0 < x < n y and g(x, n) = 0 for x^n. As n 

tends to infinity, g (x, n) tends to x p ~ l e" z for every a: > 0, and it is 
easily seen that we always have 0 S g(x ) n) < x p ~ ] e~ x . Hence by 
(5.5.2) we obtain (12.5.1). 

It follows from (12.5.1) that logT(p) = lim S n) where 


128 



12.5 


S» == p log ” + 之 log v — 2 I 。？ tP +，)• 

l 0 

Applying the Ealer-MacLaurin formula (12.2.3) to both sums in the 
last expreieion, we obtain after some reductions 


5 n = (p —• 1) logp — (p *f w + J) log |l + 


一 f?A x ) 

J x 




dx 


p + x 


dx. 


▲8 n tends to infinity, the second term on the right-hand side tends 
to — p, while the two integrals are convergent (though not absolutely), 
owing to the fluctuations of sign of P x (x). Thus we obtain 

(12.5.2) log r(p) = (p — I) log — jo 十 左 + R(p\ 

where A; is a constant, and the remainder term R (p) has the expression 

«(/-)= / ~~dx. 

J p + X 


This integral may be transformed by repeated partial integration, as 
shown in (12.2.4), and we obtain in this way 


= 22 »-( 2 »- + ( _ 1 卩 ( 2s ) ! J(f+^^ dx 
1 0 

for s = 0, 1, 2, . . . For any 5 > 0, the integral appearing here is 
absolutely convergent, and its modulus is smaller than 


A 


f dx _ A 

J x^ 1=? 27p^ 


where A ia a constant. It follows in particular that R(p) 0 as 

P — CD . 

In order to find the value of the constant k in (12.5.2), we observe 
that bj (12.4.4) we have 

log r(2p) = log r(p) -f log r(p -f J) 4 - (2p — l) log 2 — J log n. 
9-464 H. Cramir 129 



12.5 


Substituting here for the r-functiona their expresflione obtained from 

(12.5.2) , and allowing p to tend to infinity, we find after some reductions 

k \\og2n. 

We have thus proved the Stirling formula : 

(12.5.3) log r(p) =^(p —■ i) log p — p -b i log R (p), 
where 

<0 

R ^ z= fi^ dz ： = ih + 0 (p , ) 

0 

= I^ - 36^? + 0 (?) 


From Stirling’s formula, we deduce i.a. the asymptotic expression* 
w! = r (n + 1) cv ^ j V 2/tn, 
and further, when p -> co while h remains fixed, 

r(p + h) h 

By differentiation, we obtain from Stirling's formula 
logp 


rip) _i— f M 


r( P ) 




(12 6.4) 


7^-(^)* 十 + + 8 /(/+% dae - 


For p — l, the first relation gives 

OD 轉 

(12.6.6) 尸 " （ 1)== 一蚤一 J dx \ — j dx^—C } 

o ^ I 

where C is Euler a constant defined by (12.2.7). — Differentiating the eqaatioi 
r(p + 1) == p r (p\ we farther obtain 

r(p-f l) i , r(p) 
rip^^p^ rip ) 9 

130 




13.5-6 


And hence for integral valnes of p 

(12.6.6) *1 -f ! + •••+ 一 C. 

An application of the Ealer-MacLanrin formula (12.2.8) gives 

2i a t + ^ +2 f^r dx - 

n n 

Taking = n in the second relation (12.6.4), we thus obtain (cf p. 128; 

(_ - (mf = 2 ?=t - 2 i' 

n 1 

12.6. Orthogonal polynomials. — Let F(x) be a distribution function 
with finite moments (cf 7.4) a* of ail orders. We sbali say that x 0 is 
a point of increase for F(x), if F(x 0 4 - h) > F(x 0 — h) for every h> 0. 

Suppose first that the set of all points of increase of F is infinite. 
We shall then show that there exists a sequence of polynomials 
p 0 (x), !?»,• •• uniquely determined by the following conditions : 

a) p n (x) is of degree n, and the coefficient of x 11 in p n (a) is positive. 

b) The Pn(x) satisfy the orthogonality conditions 

/ ⑽綱和 {:==: 

— oo 

The Pn(x) will be called the orthogonal polynomials associated with the 
distribution corresponding to F(x). 

We first observe that for any n^O the quadratic form in the 
n + 1 variables u 0} u ly . . Un 

CO fl 

f (w 0 + a ； + • • + t4n a；* 1 )* dF(x) = 2 a i+kUiUk 

-« i, **0 

is definite positive. For by hypothesis F(x) has at least » + 1 points 
of increase, and at least one of these must be different from all the 
n zeros of Uq + • • • + Un a; n , so that the integral is always positive as 
long; as the Ui are not all equal to zero. It follows (cf 11.10) that 
the determinant of the form is positive : 

a 0 a A . . . or n 

9 

x»„ = 0,1 °* ••- ° n+, > o. 

o n a 鶊 +i • • . aan 

131 



12.6 


Obviously we must have Pq{x) = 1. Now write 

Pn(x) — 14© 4 - U t X + … + UnX n t 

where w > 0, and try to determine the coefficients Uf from the condi 
tions a) and b). Since every pi(x) is to have the precise degree t\ 
any power of can be represented as a linear combiimtion oip 0 (x ),.. ” p<(o?). 
It follows that we must have 

00 

f a^p n (x) dF(x) == 0 
— 00 

for i = 0, 1, . . n — 1. Carrying out the integrations, we thus have 
n linear and homogeneous equations between the n + 1 unknowns 
u 0 , . . tin, and it follows that any polynomial p n (x) satisfying our 
conditions must necessarily be of the form 

a 0 ... a n 

(12.6.1) Pn(x) = K . 

ffn—1 • - - ®2n—1 

1 X . . . X n 

where K is a. constant For K / 0, this polynomial is of precise 
degree n, as the coefficient of x n in the determinant is D n -i > 0. 
Thus p n (x) is uniquely determined by the conditions that jpi dF= 1 
and that the coefficient of x n should be positive. 1 ) We have thus 
established the existence of a uniquely determined sequence of ortho¬ 
gonal polynomials corresponding to any distribution with an infinite 
number of points of increase. 

If jP(a;) has only N points of increase, it easily follows from the 
above proof that the p n (x) exist and are uniquely determined for 
w == 0, 1, . . ^ — 1. The determinants D n are in this case still posi¬ 
tive for w = 0, 1, . . iV 1, but for n ^ N we have D n = 0. 

Consider in particular the case of a distribution with a continuous 
frequency function f(x) = F f (x), and let p 0 (x), ... be the corresponding 
orthogonal polynomials. If g(x) is another frequency function, we 
may try to develop g(x) in a series 

(12.6.2) g(x) = b 0 p 0 (x)f(x) + btPi [x)f{x) + •■■ 

l ) It can be shown that X = (Z)n~iZ>nr* Cf e. g. Szegd, Ref. 86. 

132 




U.6 

Multiply with pn[x) and suppose that we may integrate term by term. 
The orthogonality relations then give 

00 

(12.6.3) 6 n = / p n (x)g(x)dx. 

— co 

Thus in particular b 0 = 1. Expansions of this type may sometimes 
render good service for the analytic representation of distributions. 
一 We shall now give some examples of orthogonal polynomials. 

1. The Her mite polynomials H n (x) are defined by the relations 

(12.6.4) (盖 )"e _ ? = ( — l) n H n (x)e~ Z i (» = 0, 1 ， 2, ...). 

H n (x) is a polynomial of degree n, and we have 
ff 0 (x)= 1, H x (x) = x, 丑 “ 怎卜怎’一 1 ， 

(12.6.5) H t (x) = H A (x) = 怎 4 - 6P + 3, 

H 6 (x) = a; 6 — 10 a: 8 + 15 a:, H 9 (x) = ar 6 — 15 a; 4 + 45 x* — 15 ， 


By repeated partial integration, we obtain the relation 


(12.6.6) J H m (x)Hn(x)d0(x)=^L^J H m (x)H n (x)e~dx= 


\n\ for m=;?, 
1^0 for my^-n, 


which shows that |pp== iET n (x)J is the sequence of orthogonal poly¬ 
nomials associated with the normal distribution defined by (10.5.3). 
We also note the expansions 

(12.6.7) 

0 • 

and 


( 12 . 6 . 8 ) 


~ g,(x)g,(y) 

Zj v \ 


VT^ 


gy 

a (!-«•) 




The first of these follows simply from the definition (12.6.4). A proof of 

(12.6.8) given by Cramer will be found in Charlier, Ref. 9 a, p. 50 一 53. 
2. The Lagnerre polynomials (x) are defined by the relations 

(-r~) n (x n + i-1 e~*) (— i)n n j (x) f-*, 


133 




12.6 


which give 

L 9 (x)- 


1, L l (x) = x- X t L t (x) * 


x 3 -2a+ l)x + k(X + l) 

- 、 


By repeated partiftl integration we find 

OB 

1 J L^\x)L { i\x)x^ie^ dx^ 


ru) : 


:{(” + 卜 ) 


for m — n, 
for m 〆 n, 


f I 

80 that 18 


the sequence of orthogonal polynomials associated with 


the distribution defined by the frequency function f(x; a, X) considered in (12.3.3), 
when we take a = 1. 


3. Consider the distribution obtained by placing the mass ~y in each of the N 

points x %} x N . The corresponding distribution function ia a step-function with 

a step of height ~ in each xt. Let jp 0 (ar), . . •，多 ‘v-i( x ) be the associated orthogonal 

polynomials, which according to the above are uniquely determined. The orthogonality 
relations then reduce to 

i-l v 


These polynomials may be used with advantage e. g. in the following problem. Sup¬ 
pose that we have N observed points {x t , (x N , y^»), and want to find the para¬ 

bola y = q{x) of degree n < N ，which gives the closest fit to the observed ordinates, 
in the mum of the principle of least squares, i. e. sach that 

N 

— q (*,))* 

becomes a minimam. We then write q(x) in the form 

q(x) = C 0 p“x) + - h Cn Pn(x), 

and the ordinary rules for finding a minimum now immediately give 

N 

Cr = N^ yiPr{Xt) 

i=l 

for r = 0,1, …，”， while the corresponding minimum yaliie of V ia 

N 

Umtn = ^ 2 c * ~ c * — ' ' 一 c i. 

The case when the points x { are equidistant is particularly important in the applica¬ 
tions. In that case, the numerical calculation of q(x) and r7 m in may be performed 
with a comparatively Bmall amount of labour. Cf e. g. Esscher (Ref. 82) and Aitken 
(R«f. 60). — Cf farther the theory of parabolic regression in 21.6. 

134 



SECOND PART 

RANDOM VARIABLES 
AND PROBABILITY DISTRIBUTIONS 




Chapters 13—14. Foundations. 


CHAPTER 13. 

Statistics and Probability. 

13.1. Random experiments. 一 In the most varied fields of practi¬ 
cal and scientific activity, cases occur where certain experiments or 
observations may be repeated a large number of times under similar 
circumstances. On each occasion, our attention is then directed to a 
result of the observation 、which is expressed by a certain number of 
characteristic features. 

In many cases these characteristics directly take a quantitative 
form: at each observation something is counted or measured. In other 
cases, the characteristics are qualitative : we observe e. g. the colour 
of a certain object, the occurrence or non occurrence of some specified 
event in connection with each experiment, etc. In the latter case, it 
is always possible to express the characteristics in numerical form 
according to some conventional system of notation. Whenever it is 
found convenient, we may thus always suppose that the result of each 
observation is expressed by a certain number of quantities. 


1. If we make a series of throws with an ordinary die, each throw yields as its 
result one of the numbers 1, 2, … ， C. 

2. If we measure the length and the weight of the body of each member of a 
group of animals belonging to the same species, every individual gives rise to »n 
observation, the result of which is expressed by two number*. 

3. If, in a steel factory, we take a sample from every day’s production, and 
measure its hardness, tensile strength and percentage of coal, sulphur and phosphorus, 
the result of each observation is given by five numbers. 

4 . If we observe at regular time intervals the prices of k different commodities, 
the result of each observation is expressed by k numbers 

5. If we observe the sex of every child bom in a certain district, the result of 
each observation is not directly expressed by numbers. We may, however, agree to 
denote the birth of a boy by 1, and the birth of a girl by 0, and thus conventionally 
express our results in numerical form. 


137 




13 . 1-2 


In some eases we know the phenomenon under investigation suffi¬ 
ciently well to feel justified in makiog exact predictions with respect 
to the result of each individual observation. Thus if our experiments 
consist in observing, for every year, the number of eclipses of the 
sun visible from a g^iven observatory, we do not hesitate to predict, 
on the strength of astronomical calculations, the exadt value of this 
number. A similar situation arises in every case where it is assumed 
that the laws governing the phenomena are known, and these laws are 
sufficiently simple to be used for calculations in practice. 

In the majority of cases, however, our knowledge is not precise 
enough to allow of exact predictions of the results of individual 
observations. This is the situation, e. g., in all the examples 1 一 5 
quoted above. Even if the utmost care is taken to keep all relevant 
circumstances under control, the result may in such cases vary from 
one observation to another in an irregular way that eludes all our 
attempts at prediction. In such a case, we shall say that we are 
concerned with a sequence of random experiments. 

Any systematic record of the results of sequences of this kind will 
be said to constitute a set of statistical data relative to the pheno¬ 
menon concerned. The chief object of statistical theory is to in¬ 
vestigate the possibility of drawing valid inferences from statistical 
data, and to work out methods by which such inferences may be 
obtained. As a preliminary to the discussion of these questions, we 
shall in the two following paragraphs consider some general properties 
of random experiments. 

13 2. Examples. — It does not seem possible to give a precise 
definition of what is meant by the word »random*. The sense of the 
word is best conveyed by some examples. 

If an ordinary coin is rapidly spun several times, and if we take 
care to keep the conditions of the experiment as uniform as possible 
in all respects, we shall find that we are unable to predict whether, 
in a particular instance, the coin will fall »heada» or »tails*. If the 
first throw has resulted in heads and if, in the following throw, we 
try to give the coin exactly the same initial state of motion, it will 
still appear that it is not possible to secure another case of heads. 
Even if we try to build a machine throwing the coin with perfect 
regularity, it is not likely that we shall succeed in predicting the 
results of individual throws. On the contrary, the result of the ex¬ 
periment will always fluctuate in an uncontrollable way from one 
instance to another. 


138 



13.2 


At first, this may seem rather difficult to explain. If' we accept 
a deterministic point of view, we must maintain that the result of 
each throw is uniquely determined by the initial state of motion of the 
coin (external conditions, such as air resistance and physical properties 
of the table, being regarded as fixed). Thus it would seem theoretic¬ 
ally possible to make an exact prediction, as soon as the initial state 
is known, and to produce any desired result by starting from an 
appropriate initial state. A moment s reflection will, however, show 
that even extremely small changes in the initial state of motion must 
be expected to have a dominating influence on the result. In practice ， 
the initial state will never be exactly known, but only to a certain 
approximation. Similarly, when we try to establish a perfect uniformity 
of initial states during the course of a sequence of throws, we shall 
never be able to exclude small variations, the magnitude of which 
depends on the precision of the mechanism used for making the throws. 
Between the limits determined by the closeness of the approximation, 
there will always be room for various initial states, leading to both 
the possible final results of heads and tails, and thus an exact pre¬ 
diction will always be practically impossible. 一 Similar remarks 
apply to the throws with a die quoted as Ex. 1 in the preceding 1 
paragraph, and generally to all ordinary games of chance with dice 
and cards. 

According to modern biological theory, the phenomenon of heredity 
shows in important respects a striking analogy with a game of chance. 
The combinations of genes arising in the process of fertilization seem 
to be regulated by a mechanism more or less resembling the throwing 
of a coin. In a similar way as in the case of the coin, extremely 
small variations in the initial position and motion of the gametes 
may produce great differences in the properties of the offspring. 
Accordingly we find here, e. g. with respect to the sex of the offspring 
(Ex. 5 of the preceding paragraph), the same impossibility of indivi¬ 
dual prediction and the same »random fluctuations» of the results as 
in the case of the coin or the die. 

Next, let us imagine that we observe a number of men of a given 
age during a period of ， say, one year, and note in each case whether 
the man is alive at the end of the year or not. Let us suppose that, 
with the aid of a medical expert, we have been able to collect detailed 
information concerning health, occupation, habits etc. of each ob¬ 
served person. Nevertheless, it will obviously be impossible to make 
exact predictions with regard to the life or death of one particular 

139 



13.2 


person, since the causes leading to the ultimate result are far too 
numerous and too complicated to allow of any precise calculation. 
Even for an observer endowed with a much more advnnced biological 
knowledge than is possible at the present epoch, the practical con¬ 
clusion would be the same, owing to the multitude and complexity of 
the causes at work. 

In the examples 2 and 4 of the preceding paragraph, the situation 
seems to be largely analogous to the example just discussed. The laws 
governing the phenomena are in neither case very well known, and 
even if they were known to a much greater extent than at present, 
the structure of each case is so complicated that an individual predic¬ 
tion would still seem practically impossible. Accordingly, the observa¬ 
tions show in these cases, and in numerous other cases of a similar 
nature, the same kind of random irregularity as in the previous 
examples. 

It is important to note that a similar situation may arise even in 
cases where we consider the laws of the phenomena as perfectly known, 
provided that these laws are sufficiently complicated. Consider e. g. 
the case of the eclipses of the sun mentioned in the preceding para¬ 
graph. We do assume that it is possible to predict the annual num¬ 
ber of eclipses, and if the requisite tables are available, anybody 
can undertake to make such predictions. Without the tables, however, 
it would be rather a formidable task to work out the necessary calcu¬ 
lations, and if these difficulties should be considered insurmountable, 
prediction would still be practically impossible, and the fluctuations 
in the annual number of eclipses would seem comparable to the 
fluctuations in a sequence of games of chance. 

Suppose, finally, that our observations consist in making a series 
of repeated measurements of some physical constant, the method of 
measurement and the relevant external conditions being kept as uni¬ 
form as possible during the whole series. It is well known that, in 
spite of all precautions taken by the observer, the successive measure¬ 
ments will generally yield different results. This phenomenon is com¬ 
monly ascribed to the action of a large number of small disturbing 
factors, which combine their effects to a certain total »error» affecting 
each particular measurement The amount of this error fluctuates from 
one observation to another in an irregular way that makes it impos¬ 
sible to predict the result of an individual measurement. 一 Similar 
considerations apply to cases of fluctuations of quality in manufac¬ 
tured articles, such as Ex. 3 of the preceding paragraph. Small and 

140 



13.2-3 


uncontrollable Tariations in the production process and in the quality 
of raw mftteriaU will combine their effects and produce irregrtilar 
flnotaationB in the final product. 

The examples discussed above are representative of larg^e and 
important groups of random experiments. Small variations in the 
initial state of the observed units, which cannot be detected by our 
instruments, may produce considerable changes in the final result. 
The complicated character of the laws of the observed phenomena 
may render exact calculation practically, if not theoretically, impossible. 
Uncontrollable action by small disturbing factors may lend to irregular 
deriations from a presumed »trae value*. 

It is, of course, clear that there is no sharp distinction between 
these various modes of randomness. Whether we ascribe e. the 
fluctuations observed in the results of a series of shots at a target 
mainly to small variations in the initial state of the projectile, to the 
complicated nature of the ballistic laws, or to the action of small 
disturbing factors, is largely a matter of taste. The essential things 
is that, in all cases where one or more of these circumstances are 
present, an exact prediction of the results of individual experiments 
becomes impossible, and the irregular fluctuations characteristic of 
random experiments will appear. 

We shall now see that, in cases of this character, there appears 
amidst all irregularity of fluctuations a certain typical form of regula. 
rity, that will serve as the basis of the mathematical theory of sta¬ 
tistics. 

13.3. Statistical regularity. — We have seen that, in a sequence 
of random experiments, it is not possible to predict individual results. 
These are subject to irregular random fluctuations which cannot be 
submitted to exact calculation. However, as soon as we turn our 
attention from the individual experiments to the whole sequence of 
experiments, the situation changes completely, and an extremely im¬ 
portant phenomenon appears : In spite of the irregular behaviour of 
individual results, the average results of long sequences of random ex¬ 
periments show a striking regularity. 

In order to explain this important mode of regularity, we consider 
a determined random experiment 6, that may be repeated a large 
number of times under uniform conditions. Let S denote the set of 
all a priori possible different results of an indiriduai experiment, 
while S denotes a fixed subset of S. If, in a particular experiment, 

141 



13.3 


we obtain a result § belonging to the subset S y we shall say that the 
event defined by the relation ^ < S y or briefly the event 爸 < S ，has 
occurred. 1 ) We shall often also denote an event by a single letter E, 
writing E = < S), and we may then speak without distinction of 

»the event JE* or »the event 5 < S». 

When oar experiment @ consists in throwing a die, the set S contains the six 
numbers 1, 2, .. •• ， 6. Let 8 denote e. g. the subset containing the three numbers 
2, 4, 6. The event ^ < S then occurs at any throw resulting in an even number of 
points. 

When we are concerned with measurements of some physical constant x, the 
value of which is a priori completely unknown, it may be at least theoretically pos¬ 
sible for a measurement to yield as its result any real number, and accordingly the 
set S would then be the one-dimensional space K,. Let 8 denote e. g. the closed 
interval (a, b). The event ^ C. 8 then occurs every time a measurement yields a value 
I belonging to (a, b). 

Let us now repeat our experiment 6 a large number of times, and 
observe each time whether the event E = E(^ < S) takes place or not. 
If we find that, among the n first experiments, the event E has 
occurred exactly v times, the ratio v!n will be called the frequency 
ratio or simply the frequency of the event E in the sequence formed 
by the n first experiments. 

Now、if we observe the frequency v/n of a fixed event E for increasing 
values of w, we shall generally find that it shows a marked tendency to 
become more or less constant for large values of n. 

This phenomenon is illustrated by Fig. 3, which shows the varia¬ 
tion of the frequency v/n of the event »heads» within a sequence of 
throws with a coin. As shown by the figure, the frequency ratio 
fluctuates violently for small values of w, but gradually the amplitude 
of the fluctuations becomes smaller, and the graph may suggest the 
impression that, i£ the series of experiments could be infinitely con¬ 
tinued under uniform conditions, the frequency would approach some 
definite ideal or limiting value very near to [ 

It is an old experience that this stability of frequency ratios usu¬ 
ally appears in long series of repeated random observations, performed 
under uniform conditions. For an event of the type g < S observed 
in connection with such a series, we shall thus as a rule obtain a 
graph of the same general character as in the particular case illustrated 

*) We assame here tb&t 8 is some set of simple structure, bo tb&t it may be 
directly observed whether ^ belongs to S or not. In the following chapter, the ques¬ 
tion will be considered from a more general point of view. 


142 



133 



Fig. 3. Frequency ratio of • heads* in a sequence of throws with a coin. Logarithmic 

scale for the abscissa. 

bj Fig. 3. Moreover, in a case where this statement is not true, a 
careful examination will usually disclose some definite lack of uni¬ 
formity in the conditions of the experiments. We might thus be 
tempted to advance a conjecture that, generally, a frequency of the 
type here considered would approach a definite ideal value, if the 
corresponding series of experiments could be infinitely continued. 

A conjecture of this kind can, of course, neither be proved nor 
disproved by actual experience, since we can never perform an infinite 
sequence of experiments. The experiments do, however, strongly support 
the less precise conjecture that, to any event E connected with a random 
experiment 6, we should be able to ascribe a number P such that，in a 
long series of repetitions of ©, the frequency of E would be approxim¬ 
ately equal to P. 

This is the typical form of statistical regularity which constitutes 
the empirical basis of statistical theory. We must now attempt to 
give a precise meaning to the somewhat yague expressions used in 
the above statement, and we shall further have to investigate the laws 
that govern this mode of regularity, and to show how these laws may 

143 



13.3 


be applied in drawing inferences from statistical data. In order to 
carry out this task, we shall in the first place try to work out a 
mathematical theory of phenomena thowing statistical regularity. Be¬ 
fore attempting to do this it will, however, be convenient to give in 
the following paragraph some general remarks concerning the nature 
and object of anj mathematical theory of a group of empirically 
observed phenomena. 


Historically, this remarkable beh»Tioor of frequency ratios was first observed in 
the field of games of chance, of which oar example with the coin forms a particularly 
simple case. Already at an early epoch, it wa» observed that, in all current games 
with cards, dice etc., the frequency of % given retult of a certain game seemed to 
clasfcer in the neighbourhood of tomt deflBiU valne, when the game was repeated a 
Urge number of times. The atUmpts to give » mathematicftl explanation of certain 
obserred facts of this kind became the immediate cause of tbe origin (about lflftO) 
and first development of the Mathematical Theory of Probability, under the bands of 
Pascal, Fermat, Hojgens »nd James Bernoulli. A little later, tbe same typ® of rtfularity 
wm found to occur in frequencies connected with variout demographic data, &od the 
theory of population statistics was based on this fact. Gradually, the field of applica¬ 
tion of ttatisticul methods widened, and at tbe present time we may regard it as an 
established empirical fact that the »loog run atability* of frequency ratios is a general 
characteristic of random experiments, performed noder uniform conditions. 

In some cases, especial I j when w« are concerned with observations on iodividuaU 
from hara^n or other biological population*, this statistical regularity is often in¬ 
terpreted by considering the observed oolta as $amplf8 from some very large or even 
infinite parent population. 

Consider first the case of a finite population, consisting of N individual!. For 
any individual that comes under observation we note a certain characteristic and 
we denote by E some specified event of th© type ^ c. S. The frequency of £ in a 
sample of n observed individuals tends, as the sire of the sample increases, toward® 
the frequency of in the total population, and actually reaches this value when we 
take n =* iV, which means that we observe every individual in tbe whole population. 

The idea of an infinite parent popalation is a muthematic 攻 1 abstraction of the 
same kind as the idea that & given random experiment might be repeated an inftoite 
number of times. We may consider this as a limiting case of a finite popnlatioD, 
when the number N of individuals increaaea indefinitely. The frequency of th« event 
in a sample of n individuals from an infinite population will always be sabjeet to 
random fluctuations, as long as n is finite, bat it moy seem natural to assume that, 
for indefinitely increasing yalaes of n, this frequency would ultimately reach a *tme» 
value, correspoodiog to tbe frequency of E in the total infinite population. 

This mode of interpretation by mean 寒 of the idea of sampling may even b® ex¬ 
tended to any type of random experiment. We may, in fact, conceive of »nj flnit* 
sequence of repetitions of a random experiment n» a sample from the hypothetical 
infinite population of all experiments that might have been performed noder th« given 
conditions. — Wc shall return to this matter in Ch. 25, where the idea of sampliog 
will be further discussed. 


144 



13.4 


13.4. Object of m mathematical theory. — When, in some group 
of observable phenomena, we find evidence of a confirmed regularity, 
we may try to form a mathematical theory of the subject. Such a 
theory may be regarded as a mathematical model of the body of 
empirical facts which constitute our data. 

We then choose as our starting point some of the most essential 
and most elementary features of the regularity observed in the data. 
These we express, in a simplified and idealized form, as mathematical 
propositions which are laid down as the basic axioms of our theory. 
From the axioms, various propositions are then obtained by purely 
logical deduction，without any further appeal to experience. The 
logically consistent system of propositions built up in this way on an 
axiomatic basis constitutes our mathematical theory. 

Two classical examples of this procedure are provided by Geometry 
and Theoretical Mechanics. Geometry, e. g., is a system of purely 
mathematical propositions, designed to form a mathematical model of 
a large g-roup of empirical facts connected with the position and con¬ 
figuration in space of various bodies. It rests on a comparatively 
small number of axioms, which are introduced without proof. Once 
the axioms have been chosen, the whole system of geometrical pro¬ 
positions is obtained from them by purely logical deductions. In the 
choice of the axioms we are guided by the regularities found in 
available empirical facts. The axioms may, however, be chosen in 
different ways, and accordingly there are several different systems of 
geometry : Euclidean, Lobatsohewskian etc. Each of these is a logic¬ 
ally consistent system of mathematical propositions, founded on its 
own set of axioms. — In a similar way, theoretical mecbanieR is a 
sjstem of mathematical propositions, designed to form a mathematical 
model of observed facta connected with the equilibrium and motion 
of bodies. 

Every proposition of such a system is true, in tbe mathematical 
sense of the word, as ooon as it is correctly deduced from the axioms. 
On the other hand, it is important to emphasize that no proposition 
of any mathematical theory proves anything about the events that 
will, in fact, happen. The points, lines, planes etc. considered in pure 
geometry are not the perceptual things that we know from immediate 
experience. The pure theory belongs entirely to the conceptual 
sphere, and deals with abstract objects entirely defined by their pro¬ 
perties, as expressed by the axioms. For these objects, the proposi¬ 
tions of the theory are exactly and rigorously true. But no proposi 

145 



13.4 


tion about such conceptual objects will ever involve a logical proof 
of properties of the perceptual things of our experience. Mathe¬ 
matical arguments are fundamentally incapable of proving physical 
facts. 

Thus the Euclidean proposition that the sum of the angles in a 
triangle is equal to it is rigorously true for a conceptual triangle as 
defined in pure geometry. But it does not follow that the sum of the 
angles measured in a concrete triangle will necessarily be equal to it, 
just as ifc does not follow from the theorems of classical mechanics 
that the sun and the planets will necessarily move in conformity with 
the Newtonian law of gravitation. These are questions that can only 
be decided by direct observation of the facts. 

Certain propositions of a mathematical theory may, however, be 
tested, by experience. Thus the Euclidean proposition concerning the 
sum of the angles in a triangle may be directly compared with actual 
measurements on concrete triangles. If, in systematic tests of this 
character, we find that the verifiable consequences of a theory really 
conform with sufficient accuracy to available empirical facts, we may 
feel more or less justified in thinking that there is some kind of re¬ 
semblance between the mathematical theory and the structure of the 
perceptual world. We further expect that the agreement between 
theory and experience will continue to bold also for future events and 
for consequences of the theory not yet submitted to direct verification, 
and we allow our actions to be guided by this expectation. 

Such is the case, e. g., with respect to Euclidean geometry. When¬ 
ever a proposition belonging to this theory has been compared with 
empirical observations, it has been found that the agreement is suffi¬ 
cient for all ordinary practical purposes. (It is necessary to exclude here 
certain applications connected with the recent development of physics.) 
Thus, although it can never be logically proved that the sum of the 
angles in a concrete triangle must be equal to n, we regard it as 
practically certain — i. e. sufficiently certain to act upon in practice — 
that our measurements will yield a sum approximately equal to this 
value. Moreover, we believe that the same kind of agreement will be 
found with respect to any proposition deduced from Euclidean axioms, 
that we may have occasion to test by experience. 

Naturally, our relying on the future agreement between theory and 
experience will grow more confident in the same measure as the 
accumulated evidence of such agreement increases. The »practical 
certainty» felt with respect to a proposition of Euclidean geometry 

146 



13.4 


will be different from that connected with, say, the second law of 
thermodynamics. Further, the closeness of the agreement that we 
may reasonably expect will not always be the same. Whereas in 
some cases the most sensitive instruments have failed to discover 
the slightest disagreement, there are other cases where a scientific 
»law» only accounts for the main features of the observed facts, the 
deviations being interpreted as »errors* or »disturbances*. 

In a case where we have found evidence of a more or less accurate 
and permanent agreement between theory and facts, the mathematical 
theory acquires a practical value, quite apart from its purely mathe¬ 
matical interest. The theory may then be used for various purposes. 
The majority of ordinary applications of a mathematical theory may 
be roughly classified under the three headings: Description, Analysis 
and Prediction. 

In the first place, the theory may be used for purely descriptive 
purposes. A large set of empirical data may, with the aid of the theory, 
be reduced to a relatively small number of characteristics which repre¬ 
sent, in a condensed form, the relevant information supplied by the 
data. Thus the complicated set of astronomical observations concerning 
the movements of the planets is summarized in a condensed form by 
the Copernican system. 

Further, the results of a theory may be applied as tools for a 
scientific analysis of the phenomena under observation. Almost every 
scientific investigation makes use of applications belonging to this 
class. The general principle behind such applications may be thus 
expressed : Any theory which does not fit the facts must be modified. Sup¬ 
pose, e. g., that we are trying to find out whether the variation of a 
certain factor has any influence on some phenomena in which we are 
interested. We may then try to work out a theory, according to which 
no such influence takes place, and compare the consequences of this 
theory with our observations. If on some point we find a manifest 
disagreement, this indicates that we should proceed to amend our 
theory in order to allow for the neglected influence. 

Finally, we may use the theory in order to predict the events that 
will happen under given circumstances. Thus, with the aid of geo¬ 
metrical and mechanical theory, an astronomer is able to predict the 
date of an eclipse. This constitutes a direct application of the prin¬ 
ciple mentioned above, that the agreement between theory and facts 
is expected to hold true also for future events. The same principle is 
applied when we use our theoretical knowledge with a view to produce 

147 



13.4-5 


some determined event, as e. g. when a ballistic expert shows how to 
direct a gun in order to hit the target. 

13.5. Mathematical probability. 一 We now proceed to work out 
a theory designed to serve as a mathematical model of phenomena 
showing statistical regularity. We want a theory which takes account 
of the fundamental facts characteristic of this mode of regularity, and 
which may be put to use in the various ways indicated in the pre¬ 
ceding paragraph. 

In laying the foundations of this theory, we shall try to imitate 
as strictly as possible the classical construction process described in 
the preceding paragraph. In the case of geometry, e. g., we know that 
by certain actions, such as the appropriate use of a ruler and a piece 
of chalk, we may produce things known in everyday language as 
points, straight lines etc The empirical study of the properties of 
these things gives evidence of certain regularities. We then postulate 
the existence of conceptual counterparts of the things: the points, 
straight lines etc. of pure geometry. Further, the fundamental fea¬ 
tures of the observed regularities are stated, in an idealized form, as 
the geometrical axioms. 

Similarly, in the case actually before us, we know that by certain 
actions, viz. the performance of sequences of certain experiments, we 
may produce sets of observed numbers known as frequency ratios. 
The empirical study of the behaviour of frequency ratios gives evidence 
of a certain typical form of regularity, as described in 13.3. Consider 
an event E connected with the random experiment ffi. According to 
13.3, the frequency of £ in a sequence of n repetitions of @ shows a 
tendency to become constant as n increases, and we have been led to 
express the conjecture that for large n the frequency ratio would with 
practical certainty be approximately equal to some assignable num¬ 
ber P. 

In our mathematical theory, we shall accordingly introduce a definite 
number P t which will be called the probability of the event E with respect 
to the random experiment ®. 、 

Whenever we say that the probability of an event E with respect to 
an experiment ® is equal to P, the concrete meaning of this assertion 
will thus simply be the following: In a long series of repetitions of 
it is practically certain that the frequency of E mil be approximately 

148 



13.5 


^qual to P. 1 ) 一 This statement will be referred to as the frequency in- 
terpretation of the probability P. 

The probability number P introduced in this way provides & concep- 
tual counterpart of the empirical frequency ratios. It will be observed 
that, in order to define the probability P, both the type of random 
experiment S &nd the event E must be specified. Usually we shall, 
however, regard the experiment (£ as fixed, aod we may then without 
ambigraitj simply talk of the probability of the event E. 

For the farther development of the theory, we shall have to con- 
aider the fundamental properties of frequency ratios and express these, 
in an idealized form, as statements concerning the properties of the 
corresponding probability numbers. These statements, together with 
the existence postulate for the probability numbers, will serve as the 
axioms of our theory. 一 In the present paragraph, we shall only add 
*• few preliminary remarks; the formal statement of the axioms will 
then be giren in the following chapter. 

For any frequency ratio v/n we obviously have 0 ^ v/w ^ 1. Since, 
by definition, any probability P is approximately equal to some fre¬ 
quency ratio, it will be natural to assume that P satisfies the corres- 
pondingf ineqiMklitj 

and this will in fact be one of the properties expressed by our axioms. 

If £ is an impossible event, i. e. an event that can never occur at 
& performance of the experiment any frequency of E must be zero; 
and consequently we take P = 0. 一 On the other hand, if we know 
that for some event E we have P = 0, then E is not necessarily an 
impossible eveut. In fact, the frequency interpretation of P only 
implies that the frequency v/n of E will for large n be approximately 
equal to zero, so that in the long run E will at most occur in a very 
small percentage of all cases. The same conclusion holds not only when 
P= 0, but even under the more general assumption that 0 ^ P < £, 
where i is some very small number. If E is an event of this type, and 
if the experiment 6 ts performed one single time^ it can thus be con¬ 
sidered as practically certain that E will not occur. — This particular 
case of the frequency interpretation of a probability will often be 
applied in the sequel. 

Similarly, if is a eertain event, i. e. an event that always occurs 
at at performance of ffi, we take P = 1. — On the other hand, if we 

At 島 lftter atage (cf 16.S), ahsll &ble to gite » more precise form to thifi 

■tetomwit. 


149 



13.5 


know that P = 1, we cannot infer that E is certain, but only that in 
the long run E will occur in all but a very small percentage of cases. 
The same conclusion holds under the more general assumption that 
1 一 《 < PS 1, where e is some very small number. If E is an event 
of this type, and if the experiment 6 is performed one single time, it 
can be considered as practically certain that E will occur. • 


With respect to the foondationa of the theory of probability, many different 
opinions are represented in the literature. None of these has so far met with uni- 
versftl acceptance. We shall conclude this paragraph by a very brief survey of some 
of the principal standpoints. 

The theory of probability originated from the study of problems connected with 
ordinary games of chance (cf 13.3). In all these games, the results that are a priori 
possible may be arranged in a finite number of cases supposed to be perfectly aym- 
metrical, each as the cases represented by the six sides of a die, the 62 cards in an 
ordinary pack of cards, etc. This fact seemed to provide a basis for a rational explana- 
tioo of the observed stability of frequency ratios, and the 18 th century mathematicians 
were thus led to the introduction of the famous principle of equally possible catts 
which, after having been more or less tacitly assumed by earlier writers, was ex¬ 
plicitly framed by Laplace in his classical work (Ref. 22) as the fundamental prin* 
ciple of the whole theory According to this principle, a division in »equally pos¬ 
sible* cases is conceivable in any kind of observations, and the probability of an 
event is the ratio between the number of cases favourable to the event, and the to¬ 
tal number of possible cases. 

The weakness of this definition is obvious. In the first place, it does not tell as 
how to decide whether two cases should be regarded as equally possible or not. 
Moreover, it seems difficult, and to some minds even impossible, to form a precise 
idea as to how a division in equally possible cases could be made with respect to 
observations not belonging to the domain of games of chance. Much work has been 
devoted to attempts to overcome these difficulties and introduce &d improved form of 
the classical definition. 


On the other hand, many authors have tried to replace the classical definition 
by something radically different. Modern work on this line hM been largely in¬ 
fluenced by the generrl tendency to build any mathematical theory on an axiomatic 
basis. Thus some authors try to introduce a system of axioms directly based on the 
properties of frequency ratios. The chief exponent of this school is von Mises (Ref. 
27, 28, 169), who defines the probability of an event as the limit of the frequency v/n of 
that event, as n tends to infinity. The existence of this limit, in ft strictly mathema¬ 
tical sense, is postulated as the first axiom of the theory. Though undoubtedly a 
definition of this type seems at first sight very attractive, it involves certain mathe¬ 
matical difficaltie8 which deprive it of a good deal of its apparent simplicity. Be¬ 
sides, the probability definition thus proposed would involve a mixture of empirical 
and theoretical elements, which is usually avoided in modern aziottiatic theories. It 
wonld, e. g., be comparable to defining a geometrical point as the limit of a chalk 

150 



13.5-14.1 


spot of intlnitelj decreasing dimenBiona, which is usually not done in modern axio> 
m*tio geometry. 

A farther school chooses the s»me obseryational starting-point as the frequency 
school, but •▼oids postuUtiog the existence of definite limits of frequency ratios, and 
introduces the probability of an event simply as a cumber associated with that event. 
The axioms of the theory, 'which express the rules for operating with such numbers, 
are idealized statements of observed properties of frequency ratios. The theory of 
this school has been exposed from a purely mathematical point of view by Kolmo- 
goroff (R«f. 21). More or less Bimilar standpoints are represented by Doob, Feller 
and Neyman (Ref. 76, 84, 80). A work of the present author (Ref. 11) belongs to tbe 
■Ame order of ideas, and the present book constitutes an attempt to build the theory 
of statistics ou the same principles. 

So far, we bare throughout been concerned with the theory of probability, con¬ 
ceited as a mathematical theory of phenomena showing statistical regularity. Ac¬ 
cording to this point of view, the probabilities have their counterparts in observable 
frequency ratios, and any probability number assigned to a specified event must, in 
principle, b« liable to empirical ▼erification. The differences between the various 
schools mentioned above are mainly restricted to th« foundations and the mathema¬ 
tical exposition of the subject, whereas from the point of view of the applications th« 
Tarioos theories are largely equivalent. 

In radical opposition to all the above approaches stands the more general 
conception of probability theory as a theory of degrees of reasonable belief repre¬ 
sented c. g. by Keynes (Ref. 20) and Jeffreys (Ref. 18). According to this theory in its 
most advanced form given by Jeffreys, any proposition has a numerically measurable 
probability. Thus c. g. we should be able to express in definite numerical terms the 
degree of »practical certainty* felt with respect to tbe future agreement between 
aome mathematical theory and observed facts (cf 13.4). Similarly there would be a 
definite numerical probability of the truth of any statement such as- •The 'Masque 
de Fer’ was the brother of Louis XIV», »Th« present European war will end within 
瓤 year*, or »There is organic life on the planet of Mars*. Probabilities of this type 
have do direct connection with random experiments, aDd thus no obvious frequency 
interpretation. In the present book, we shall not attempt to discuss the question 
whether such probabilities are numerically measurable and, if this question could be 
answered in the affirmative, whether such measurement would serve aiiy useful 
purpose. 

CHAPTER 14. 

Fundamental Definitions and Axioms. 

14.1. Random varlablM?—(Axioms 1-2.) — Consider a determined 
randmff J, experiment "^which may be repeated a large number of times 
under uniform conditions. We shall suppose that the result of each 
particular eiperiment is given by a certain number of real quantities 

Sit ?*»•••»§*> where k ^ 1. 


151 



We then introduce a corresponding variable point or Teotor g = 
(f lt in the ^-dimensional space H*. We shall call g a i-di- 

mensional random variable. 1 ) Each performance of the experiment S 
yields as its result an observed value of the variable §, the coordinates 
of which are the values of observed on that particular 

occasion. 

Let S denote some simple set of points in Rt 、say & ^dimensional 
interval (cf 3.1), and let us consider the event § < 5, which may or 
may not occur at any particular performance of ®. We shall aaaume 
that this event has a definite probability P, in the sense explained in 
13.5. The number P will obviously depend on the set 5, and will 
accordingly be denoted bj any of the expressions 

P = P(5) = P(g<5). 

It is thus seen that the probability may be regarded at a set func¬ 
tion ^ and that it seems reasonable to require that this set function 
should be uniquely defined at least for all ib-dimensional intervals. 
However, it would obviously not be convenient to restrict ourselves 
to the consideration of intervals. We may also want to consider the 
probabilities of events that correspond e. g. to sets obtained from 
intervals by means of the operations of addition, subtraction and 
multiplication (cf 1.3). We have seen in 2.3 and 3.3 that, by such 
operations, we are led to the class of Borel sets in Rk as a natural 
extension of the class of all intervals. It thus seems reasonable 
directly to extend our considerations to this class, and assume that 
P (S) is defined for any Borel set. It is true that when S is some 
Borel set of complicated structure, the event % ( S maj not be directly 
observable, and the introduction of probabilities of events of this 
type must be regarded as a theoretical idealization. Some of the con¬ 
sequences of the theory will, however, always be directly observable, 
and the practical value of the theory will have to be judged from the 
agreement between its observable consequences and empirical facts. 一 
We may thus state our first axiom : 

j/Axiom 1_ —— To any random variable § in Rk there corresponds a 
set function P (5) uniquely defined for all Borel sets S in Jl*, such that 
P (S) represents the probability of the event (or relation) % C S. 

*) Throughout the exposition of the general theory, random variables will pre¬ 
ferably be denoted by the letters ^ and ij. We use heavy-faced types for multi-diman- 
aional variables (k > 1), and ordinary types for one-dimensiooAl variable 氤 

152 



14.1 


At we hare »een in 13.6, it will b« natural to &ttume that any 
probability P satisfies the inequality 0 ^ P 1. Further, at any per¬ 
formance of the experiment (£, the obierred value of § must lie some¬ 
where in Rty so that the event § < is a certain event, and in ac¬ 
cordance with 13.5 we then take P (H t ) = 1. 

Let now jS! and S a be two seta in Rk without a common point. 1 ) 
Consider a sequence of n repetitions of ffi, and let 

v, denote the number of occurrences of the event | C S ly 

* > * » » ， 、 、 ％ ( S t 、 

y » » » * » »»»§<C5 r l + *S"|. 

We then obvioualj have y = v, + y,, and hence the corresponding 
frequency ratios satisfy the relation 

U + &. 

n n n 

For larg：e values of n it i«, by assumption, practically certain that 

the frequencic® -» — and — are approximately equal to P(S l + 5 t ), 
n n n n 

P(S t ) and P(S 9 ) respectively. It thus seems reasonable to require that 
the probability P should possess the additive property 

P(s t + s t ) = P(5,) -f P(S t ). 

The argument extends itself immediately to any finite number of sets. 
In order to obtain a simple and coherent mathematical theory we shall, 
however, now introduce a further idealization. We shall, in fact, 
assume that the additive property of P(S) may be extended even to 
an enumerable sequence of sets S ly S， t . . • ， no two of which have a 
common point, ao that we have P(S t + + . ) = P(5!) + -P(^i) + 

(As in the case of Axiom 1 this implies, of course, the introduction 
of relations that are not directly observable.) Using the terminology 
introduced in 6.2 and 8.2, we may now state our second axiom: 

s^Cxiom 2. — The function P(S) is a non-negative and additive set 
function in such that P (Rk) = 1 • 

According to 6.6 and 8.4, any set function P(S) with the proper¬ 
ties stated in Axiom 2 defines a distribution in H*, that may be con¬ 
cretely interpreted by means of a distribution of a mass unit over 
')A» already stated io 6.1, we only consider Borel »et8. 

153 



14.1-2 


the space H*, such that any set S carries the mass P(5). This dis¬ 
tribution will be called the probability distribution of the random 
variable and the set function P(S) will be called the probability 
function (abbreviated pr.f.) of §. Similarly, the point function JP(*)= 
F(x ly .. Xk) corresponding to P(S) t which is defined by (6.6.1) in the 
case 灸 = 1， and by (8.4.1) in the general case, will be called the^jrfiV 
tribnUon - {mbbreviated d./.) of §. As shown in 6.6 and 8.4, 


the distribution may be uniquely defined either by the set function 
P(5) or by the point function F(»). 

Finally, we observe that the Axioms 1 and 2 may be summed up 

in the following statement: Any random variable has a unique prob- 

- • ~~~ 〜 


If, e. g., the experiment (S coosists in making a throw with a die, and obMryiog 
the number of points obtained, the correspoodiog random variable { is a mimbei 
that may aMame the values 1, 2,..., 6, and these values only. Our axioms then 
assert the existence of a distribution in R x with certain masses p x , p t> ..p 6 placed 
io the points 1, 2, • • ” 6, inch that p r representa the probability of the event { a r, 
« 

while 2 Pr ** 1. On the other hand, it is important to observe that it does not follow 
i 

from the axioms that / > f ** J for ©rery r. Th« d ambers p r should, in fact, be regarded 
m physical coDstanta of the particuUr die that w« are using, and the question as to 
their namericftl T*lue« cannot be •nawered by the axioms of probability theory, anj 
more than th« size and th« weight of the die are determined by the geometrical and 
meobanical axioms. However, experience shows tbat in • well made die the frequency 
of any •▼ent ^ r in a long series of throws nsually approaches ), and accordingly 
we shall often aMume tbat all the p r are equal to J, when the example of the die 
[■ used for purposes of illustration. This is, however, an assumption and not a logical 
oonseqaence of the Axioms. 

If, on the other hand, (S consista in observing the stature ^ of a man belonging 
to some giren group, ( may assume any value within a certain part of the scale, 
md oor axioma now 纛 Mert the existence of a non-negative and additive Mt fuoction 
P(S) in K, such tbat P(5) represeots the probability th»t ^ takes a valae belonging 
to the set S. 

The Axioms 1 and 2 鼻 r«, for the class of random rariables here considered, 
eqnivAlent to the Axioms giren by Kolmogorolf (Ref. 21). The axioms of Kolmogo- 
rofl are, however, Applicable to random ▼ftrUblei defined in spaces of » more general 
Dbftracter thftn those here considered. The Mime axiom ■ 瓤 ■ *bor« w«re used in » work 
of the present mtbor (Ref. 11). 


14.2. Combined rarlables. (Axiom 3.) 一 We shall first consider a 
particular case. Let the random experiments @ and gf be connected 
釋 ith the one-dimensional random ytriables S and rj respectively. Thus 
khe result of S it represented by one single quantity f, while the 

154 



14.2 


result of 2f is another quantity rj. It often occurs that we have oc¬ 
casion to consider a combined experiment (G ， 3) which consists in 
making, in accordance with some given rule, one performance of each 
of the experiments S and S，and observing jointly the results of both. 

This means that we are observing a variable point (g, rj), the co¬ 
ordinates of which are the results f and rj of the experimenta @ and 
We may then consider the point (f，”) as representing a two-dimen¬ 
sional variable, that will be called a combined variable defined by g 
and rj. The space of the combined variable is the two dimensional 
product space (cf 3.5) of the one dimensional spaces of ^ and rj. 

Let the experiment @ consist in a throw with a certain die, while g coosidts in 
a throw with another die, and the combined experiment (©, JJ) consists in a throw 
with both dice. The result of 这 ia a number $ that may assume the values 1 ， 2, •. . ， 6, 
and the same holds for the result ij of The combined variable {^, tf) then ex¬ 
presses the joint results for both dice, and ita possible »valuca» are the 36 pairs of 
numbers (1, 1),..(6, 6). 

If, on the other hand, the experiment © consists in observing the stature { of 裊 
married man, while 3 consists in observing the stature rj of a married woinao, the 
combined experiment (<S, 3) may consist e. g. in observing both statures (^, tj) of % 
married couple. The point (^, rj) may in this case assume any position within a 
certain part of the plane. 

The principle of combination of variables may be applied to more 
general cases. Let the random experiments 6,, ..., 6 n be connected 
with the random variables •••>§!> of k l} . . . y h n dimensions re¬ 
spectively, and consider a combined experiment (6,, .. @ w ) which 

consists in making one performance of each ©*，and observing jointly 
all the results. We then obtain a combined variable (§" ••• ， 妄 n ) 
represented by a point in the (lc x 4 - - + ifc n )-dimensional product space 

(cf 3.5) of the spaces of all the 赛 ，. 

The empirical study of frequency ratios connected with combined 
experiments discloses a statistical regularity of the same kind as in 
the case of the component experiments. Any experiment composed of 
random experiments shows, in fact, the character of a random experi¬ 
ment, and we may accordingly state our third axiom : 

Axiom 3 . —— If | n are random variables, any combined 

variable is also a random variable. 

It then follows from the preceding ： axioms that any combined 
variable has a unique probability distribution in its space of A：, *f • • + 心 
dimensions. This distribution will often be called the joint or simul¬ 
taneous distribution of the variables 

155 



14JI 


Consider now the case of two random Tariablea § and % of h v and 
h % dimensioni reapectiTeij. Let P, and P t denote the pr. f:s of § and 
tj, while P denotes the pr. f. of the combined variable (§, i]). If 8 
denotes & let in the space of the variable the expreBBion P (§ < 5) 
represents the probabilitj that the combined variable (g t tj) takes a 
value belonging to the cylinder set (cf. 3.5) defined by the relation 8 、 
or in other wordi the probability that g takes a value belonging to 
S 、irrespective of the value of r\. Similarly, if T is a set in the space 
of the expression P(tj c ： T) represents the probability that tj takes 
a value belonging to T, irrespective of the value of g. We thus have 

(14.2.1) P(i <S) = P^S), P(n< r) = P t (T), 

and according to (8.4.2) this shows that the marginal distributions of 
the (h + i(; t )*dimensioD&l combined distribution, relative to the sub- 
spaces of the Tariablet § and are identical with the digtributioiu 
of § and r\ respectively. —— Obviouily this may be generalized to any 
number of component T&riablei. When the mats in the combined 
distribution is projected on the subtptce of any of the component 
rariables, the marginal distribution thus obtained will always be ident¬ 
ical with the distribution of the corresponding variable. 


An important cmc of combination of Tariablea 纛 riae* when wt consider % seqoenoe 
of repetitions of a random experiment (t. Let us form • combined experiment by 
performing n times the tame experiment and observing »11 the results " •，&• of 
the n repetitions. The result of this combined experiment will then be an ob¬ 
served value of the combined vtriabl« , . , gn)，which expresses th« joint result* 
of all the n repetitions of d. 

If, e. g ， © consists in 島 throw with 纛 die, the corresponding one-dimensioD 鼻 1 
random variable ^ has the six possible v»lne« 1,2, ... ， 6. The combined ▼•ri»ble 
((i ,. ， 荟 n) then expresses the joint result* of n auccesaiTe throws, and its •▼•lues* 
are the 6 n systems of n numbers (1, .. ， 1), •(6 ， , 6). According to Axiom 8, 

there exists a corresponding probability distribution in Rn, with determined prob¬ 
abilities p lf j,. . , p 0 0 correaponding to tbe various possible values of the 
combined variable. 


In problems where several random variables are considered simul¬ 
taneously, we shall always assume that a rule of combination is given 
for all the variables that enter into the question t so that the combined 
variable is defined. We shall then as a rule use the symbol P(S) to 
denote the pr. f. of the combined variable. 


156 



14.3 


14 ^. Gonditlonai distrlbutiona. — L«t fi md be random vari¬ 
ables of h x and k t dimensiofifl, attached to the random experiments 6 
and g. Let P denote the pr. f. of the combined variable (§ ， ij), while 
S and T are sets in the spaces of § and respectively. The expres¬ 
sion P(§ < S, 7 i < T) then represents the probability of the event 
defined by the joint relations § < S f < T t or, in other words, the 
probability that the combined variable (§, takes a value belonging 
to the rectangle set (cf 3.5) with the sides S and T. 

Suppose now that P(g < S)>0 We then introduce a new quantity 
P(ti < m < S) defined bj the relation 

(14.3.1) P(i / <r||<g)= P(§ p ( gV^r~ . 

Similarly, supposing* that < T) > 0, we introduce another new 
quantity P(§ C S j <C T) by writing 

(14.3.2) P(|<Shi<D= - 


In order to justify the names that will presently be given to these 
quantities, we shall now deduce some important properties of the 
latter. 

In the first place, let us in (14.3.2) consider T as a fixed set, while 
S is variable in the space of the variable §. The second member 
of (14.3.2) then becomes a non-negative and additive function of the 
set S. When S = Rk t - the rectangle set § < < T is identical 

with the cylinder set (cf 3.5) T, so that the second member of 

(14.3.2) then assumes the value 1. Thus P(^< S\ri <. T) is, for fixed 
r, a non-negative and additive function of fche set S which for 
S = Rk t assumes the valu« 1. In other words, P (§ C ^ 1< T) is, for 
fixed T f the probability function of a certain distribution in In the 
same way it is shown that P c ： T | § < S) is, for fixed 5, the pr. f. of a 
certain distribution in the space Rk t of the variable ^ 一 We shall 
now show that, in a certain generalized sense, these quantities may in 
fact be regarded as probabilities having a determined frequency in¬ 
terpretation. 

Consider a sequence Z of n repetitions of the combined experi¬ 
ment ((£, 3) - Each of the n experiments which are the elements of Z 
yields a 4 i its result an observed »value* of the combined variable 
(g, 巧 ). In the sequence Z, let 


157 



143 


v x denote the number of oocarreoces of the event %< S 、 

» » » » » » > < T, 

y ， > » ， » •» » g < S, ij < T, 

while Z lv Z % and Z are the corresponding sub-seqaences of Z. 一 
Obviously the third event occurs when and only when the first and 
second events both occur, bo that Z consists precisely of the elements 
common to Z x and Z v 

According; to the frequency interpretation of a probability (cf 13.5 )， 
it is practically certain that the relations 

P(|<5)=5-, = F(g<S,r,<T) = ^ 

n n n 

will, for large n, be approximately satisfied. By (14.3.1) and (14.3.2) 
we then have, approximately, 

(14.3.3) P(ij<T|§<5) = -. P(|<S|»iC ： r) = -- 

v i v i 

Consider now the t t elements of the sub-sequence Z v These are all 
cases among our n repetitions, where the event has occurred. 

Among these, there &re exactly v cases where, in addition, the event 
71 < T has occurred, viz. the r cases forming the sub-sequence Z. 

y 

Thus the ratio — is the frequency of the event < T in the sub- 

y 

sequence Z x or, as we may express it, — is the conditional frequency 

v \ " 

of the event ti< T、relative to the hypothesis S. The corresponding 

y 

property of the ratio — is obtained by simple permutation. — The 

y t 

approximate relations (14.3.3) now provide a frequency interpretation of 
the expressions P(ij C T11 c ： S 1 ) and P(§ < 5| < T), which will justify 

tho introduction of the following definitions : 

The quantity P(ri < T| g < 5) defined by (14.3.1) will be called the 
conditional probability of the event < T, relative to the hypothesis § < S. 
Accordingly, the distribution in Rk 9 defined by [14.3.1) for fixed S will be 
called the conditional distribution of % relative to the hypothesis § < S. 
一 With respect to the quantity P(g < S\ni < Tj defined by (14.3.2), tve 
shall use the denominations obtained by permutation of symbols. 

It should be well observed that each conditional probability is hereby 

168 



14.3-4 


defined only in the case when the probability of the corresponding hy¬ 
pothesis is different from zero. 

When P(| < 8) and P(r] T) are both different from zero, we 
obtain from (14.3.1) and (14.3.2) the relation 

(14.3.4) P(§ < ^ < T) = P(§ < 5)P(^ < T|§ < ,S)= 

= P(i|<T)P(g<5|ij<T). 

In the example considered in the preceding paragraph, where ^ is the stature of 
• married man, and tj the stature of bis wife, the data corresponding to all observed 
valuei of { determine the distribution of Thus e. g. the probability of the relation 
o < f < 6 will be approximately determined by the frequency of the corresponding 
erent in the totality of our data. 

Suppose now that we select from our data the subgroup of nil cases where tj is 
larger than some given constant c. The data corresponding to the values of ^ in the 
cases belonging to this subqroup determine the conditional distribution of relative 
to the hypothesis tj > c. Thus e. g. the frequency of the event a< ^ Sb withiu the 
subgroup is a conditional frequency as defined above, and for a large number of 
observations this becomes, with practical certainty, approximately equal to the con¬ 
ditional probability of the relation a<^Sb t relatire to the hypothesis fj>c. Here 
the set 8 is the interval a < ^ ^ b, while the set T is the interval rj > c. 

It is evident that, in this case, we have reason to suppose that the conditioDal 
probability will differ from the probability in the totality of the data, since the tailor 
women corresponding to the hypothesis rj > c may on the average be expected to 
choose, or b« chosen by, taller husbands than the shorter women. 

On the other hand, let 吞 still stand for the statare of a married man, while «/ 
denotes the statare of the wife belonging to the couple immediately following ^ in the 
population register from which our data are taken. In this case, there will be no 
obrioas reason to expect the conditional probability of the relation a < ^ ^ b t relatiye 
to the hypothesis tj > c, to be different from the unconditional probability P(a < H h\ 
On the contrary, we should expect the conditional distribution of { to be independent 
of any hypothesis made with renpect to rj, and conversely. If this condition is satisfied, 
we are concerned with the case of independent variables, that will be discussed in the 
following parugraph. 

14.4. Independent variables. — An important particular case of 
the concepts introduced in the preceding paragraph arises when the 
multiplicative relation 

(14.4.1) P(§<5, 12<T) = P(§< S)P(ri<T) 

is satisfied for any sets S and T. The relations (14.3.1) and (14.3.2) 
show that this implies 

(14.4.2) P(g<5|i2<r) = P(§<5) if P(ri < J) > 0, 

159 



U.4 


(14.4.3) P(yi<T\^cS)^P(n<T) if P(§ C S) > 0 ， 

so that the conditional distribution of § is independent of any hypothesis 
made with respect to tj, and conversely. 

In this case we shall say that § and vj| are independent random 
variables, and that the events §<5 and ^ < T are independent events. 

Conversely, suppose that one of the two last relations, say (14.4.2), 
is satisfied for all sets S and T sach that the conditional probability 
on the left-hand side is defined, i. e. for P(ij < T) > 0. It then fol¬ 
lows from (14.3.2) that the multiplicative relation (14.4.1) holds in all 
these cases. (14.4.1) is, however, trivial in the case P{n\ < T) = 0, 
since both members are then equal to zero. Thus (14.4.1) holds for 
all S and T, and hence we infer (14.4.3). Thus either relation (14.4.2) 
or (14.4.3) constitutes a necessary and sufficient condition of independence. 

We shall now give another necessary and sufficient condition • Let 
Pj and P t denote the probability functiont of § and tj, while the 
distribution functions of ij and (§, ij) are 

I (*) = ，-. • ，办 i) ^ -^i (5i $ ’i ， . • • ， S 怎 *•)， 

^*(r) = -^i(yi> …， yO = P，(Vt • • •» ^ y*.)» 

F(x.y) = F(x it yO = P(?» ^ Xi, rjj <t yj\ 

for all t = 1, 2, . .^ and ; = 1, 2, .. A;,. According to (14.2.1), the 

multiplicative relation (14.4.1) may be written 

(14.4.4) P(§ = PAS)P t (n 

Now it has been shown in 8.6 that, when P t and P, are given pr. 
f:s in the spaces of § and %, there it one and only one distribution 
in the product space satisfying (14.4.4), viz. the distribution defined 
by the d. f. 

(14.4.5) FM-F^F^y). 

Thus (14.4.5) is a necessary and sufficient condition for the independence 
of the variables § and ij. * 

Consider now the case of n random variables .. . ， §n, with pr. 
f：8 Pi, ..., P n and d. f;s F l} ..., F n . Let P and F denote the pr. f. 
and the d. f. of the combined variable (§,,..§ n ). In direct general¬ 
ization of the above, we shall say that are independent random 

variables t if the multiplicative relation 


160 



(14.4.6) P(g, < S, ， … ， 5.)= U P(| r < 5 r ) = n Pr(Sr) 

f=l fB I 

is satisfied for any sets 、 … 、 S n . Using the final remark of 8.6, we 
find that the condition (14.4.5) may be directly generalized, so that 
in the present case the relation F — F t - -F n is a necessary and 
sufficient condition of independence. — If § r and the combined variable 
(Si ， • • • ， §r-i) are independent for r = 2 ,3,..., n, then g lt .. § n are in¬ 
dependent. This follows directly from the independence definition (14.4.6). 

If, in a sequence §!,§，，.., any group §!, …， §n ofvariables 
are independent, we shall briefly saj that g tl §，，... form a sequence 
of independent variables. 一 An important case of a sequence of this 
type arises when we consider a sequence of repetitions of a random 
experiment ©. If the conditions of the successive experiments are 
strictly uniform, the probability P of any specified event connected 
with, say, the w:th experiment cannot be supposed to be in any way 
influenced by the results of the w — 1 preceding experiments. This 
implies, however, that the distribution of the random variable § n con¬ 
nected with the w:th experiment is independent of any hypothesis made 
with respect to the value assumed by the combined variable § n —i), 

80 that § n and (§"••• ， § n -i) are independent. According to the above, 
it then follows that g” §,， ... form a sequence of independent vari¬ 
ables. A sequence of repetitions of a random experiment S showing 
a uniformity of this character will be briefly denoted as a sequence 
of independent repetitions of 6. When nothing is said to the contrary, 
we shall always assume that any sequence of repetitions that we may 
consider is of this type. 


Consider a combined experiment consisting of two throws with a certain die. 
Let us repeat this combined experiment a large number of timea, the conditions of 
each single throw being kept as uuiform as possible. We may then study the be¬ 
haviour of the conditional frequency of aoy given result of the second throw, relative 
to any hypothesis made with respect to the result of the first throw. Long experi¬ 
ence has failed to detect aoy kind of influence of such hypotheses on the behaviour 
of the conditional frequency, and it seems reasonable to assume that the random 
variables connected with the two throws are independent. The same situation 纛 risca 
when we connider a combined experiment consisting of n throws, where n may have 
any value, and accordingly we assume that a sequence of throws made under uniform 
conditions form a sequence of independent repetitions, in the sense stated above. 

Suppose dow that, in each throw, all the six possible resalts have the probabi¬ 
lity J. Then by (14.4.6) each of the 0 n possible results of n consecutWe throws will 
have the probability (1)«. 


161 



14.4-5 

Finally, let as consider n independent variables § 1 , … ， If, in 
the multiplicative relation (14.4.6), we allow a certain number of the 
vets S r to coincide with the whole spaces of the corresponding; vari¬ 
ables, it follows that any group of n l < n of the variables are inde¬ 
pendent 

The converse of the laai pfoposition is not true. We shall, in fact, give An 
example dae to S. B«rnBtein of three one^imeDsional yariables sneh that any 

two of the Tariables are independent, while the.three variables tj, ^ are not independent. 
L«t the three-dimensional distribution of the combined variable ({, fj, () be such that 
e«eh of the four points 

(1,0,0) 

(0,1,0) 

( 0 , 0 , 1 ) 

( 1 , 1 , 1 ) 


carries the mass i ， It is then easily verified that any one-dimensional marginal dis* 
tribatiun has a mass equal to | in each of the two points 0 and 1, while any two- 
dimensional marginal distribution has a mass equal to ^ in each of the four points 
(0,0), (1,0), (0,1) and (1,1). It follows that any two of the variables are independent. 
We have e. g. 

=» 1» 17 = 1 ) *= P(^ = l)P(rj = 1) 03 (4)* 3=8 J, 

and it is seen without difficulty that the analogous relation holds for any events 
^ <-8 and tj < T f *0 that (14.4.1) is satisfied. But the thru variables ^ are not 
independent, m we hare 

戶 (f *= 1, = I# S M 1) — } 

but 

P(^ = 1) P(tj = 1) P(5 « 1) — (J)* ^ J. 


14.5. Functions of random variables. — Consider first the case of 
a one-dimen8ional random variable ^ with the pr. f. P. Suppose that, 
at each performance of the random experiment to which 5 is attached, 
we do not observe directly the variable S itself, but a certain real- 
valued function g (5), which is finite and uniquely defined for all real f• 
As usual we assame that 卩⑹ is 丑 measurable (cf 6.2). 

The equation 巧 = 夕 (5) defines a correspondence between the vari¬ 
ables 5 And ij. Denote by F ft given set on the ly-axis, and by X the 
corresponding 1 set of all $ such that r ； = p (5) C Y. It has been shown 
in 5.2 that the set X corresponding to any Borel set F is a Borel 
set- When X and Y are correBponding' sets, we have i] < Y when 
and only when 5 < X, so that the two events rj < Y and | < X are 
completely equivalent. The latter event has, by Axiom 1, a definite 
probability P (X), and thus the event f)< Y has the same probability. 

162 



143 


We thus see that any function r) = g(^) of the random variable 5 is 
itself a random variable, with a probability distribution determined by 
the distribution of In fact, if Q denotes the pr. f. of rj, it follows 
from the above that we have for any Borel set Y 

(14.5.1) ^(F)=P(X), 

where X is the set corresponding to Y. If, in particular, we choose 
for the set Y the closed interval (一 oo, y), and denote by S v the set 
of all I such that i? = ^ (§) ^ y, it follows that the d. f. of the variable ij is 

(U.5.2) G(y)^Q(rj^p)^P(Sy). 

Let the ^-distribution be interpreted in the usual way as a distribution of maBS 
on the |*axi8. Let us imagine that every mass particle in this distribution is moved 
from its original place on the | axis, first in a vertical direction until it reaches tbe 
curve 7] — g (^), and then horizontally towards the i^-axis. Tbe dintribation on the ij- 
axis generated in this way will be the distribution defined by (14.6.1). 

The above considerations are immediately extended to any number 
of dimensions. Let g = ( 匕 ，...，£) be a random variable in a j-dimen- 
sional space Rj % with the pr. f. P. Consider a ^-dimensional vector 
function 々 = 夕 (§} = (ih，•. • ， i?*)，which is finite and uniquely defined 
for all § in R Jy and is itself represented by a point in a 左 dimensional 
space Rk. We assume that any component rj v of ^ is a 五 -measurable 
function (cf 9.1) of ^the variables 5i >•••>& • It then follows as in the 
one-dimensional case that is a random variable in with a pr. f. 
Q determined by the relation (14.5.1) where, now, Y denotes any given 
set in Rky while X is the corresponding set of all § in Rj such that 

Q (§) < Y. 

For a set Y such that the corresponding set X is empty» we 
obtain, of course, Q(y) = 0. -一 The condition that g(^) should be 
finite and uniquely defined for all § in Rj may obviously be replaced 
by the more general condition that the points § where 分 (§) is not 
finite or not uniquely defined, should form a set S such that P(S) =f 0. 

As an example, we may take g (§) = , .. . ， § r ), where r<3, so 

that 分 (g) is simply the projection of the point § on a certain subspace 
(cf 3.5) of r dimensions. The pr. f. of is then ^(T) == P(X), 
where F is a set in the subspace, while X is the cylinder set (cf 3.5) 
in Rj defined bj the relation (!" …， &• ， 0,.. ., 0) c ： Y. The corre¬ 
sponding distribution is the marginal distribution (cf 8.4) of (5! ， … ， fr), 
which is obtained by projecting the original distribution on the r- 
dimensional subspace. Taking, in particular,，• = 1 ， it is seen that 

163 



14.5-6 


every component of the random variable § is itself a random variable, 
with a marginal distribution obtained by projecting the origrinal 
distribution on the axis of 

A function ^ of w random variables may be regarded 

as a function of the combined variable (§j, ...，§”)• Thus according 
to the above is always a random variable, with a probability dis¬ 
tribution uniquely determined bj the simultaneous distribution of 

. . • ， §篤 • 

If g n are independent variables, it is immediately seen that 

the variables , ffn(§n) are also independent. 

14.6. Conclusion. 一 The contents of the present chapter may be 
briefly summed up in the following way. — From the domain of 
empirical data connected with random experiments, we have selected 
the fundamental fact of statistical regularity, viz. the long run sta¬ 
bility of frequency ratios. In our mathematical theory, we have 
idealized this fact by postulating 1 the existence of conceptual counter¬ 
parts of the frequency ratios: the mathematical probabilities. The 
process of idealization has th 印 been carried one step further by our 
assumption that the additive property of the probabilities may be 
extended from a finite to an enumerable sequence In 

this way, we ba^e reached the concept of a random variable^ its 
Drobabiltt^ (UcAvilnitim 

We have further introduced the assumption that any number of 
random experiments may be joined to form a combined random ex¬ 
periment, showing the same kind of statistical regularity as the 
component experimenta. Thus we have obtained the idea of the joint 

具 number of random variables. 

The study of certain conditional frequencies has led us to intro¬ 
duce their conceptual counterparts, under the name of conditional 
probabilities. These are connected with a certein conditional distribu¬ 
tion of a random variable, which in a particular case gives rise to the 
important concept of independeri^ random variables.' 

Finally, it has been shown that^u^orfaJurable function of any 
number of random variables is itself a random variable, with a pro¬ 
bability distribution uniquely determined bj the joint distribution of 
the arguments. 

We have thus laid the foundations for a purely mathematical 
theory of random variables and probability distributions. Our next 
object will now be to work out this theory in detail, and the rest of 

164 





14.6 


Part II will be devoted to this purpose. In Chs 15 一 20 we shall 
mainly be concerned with variables and distributions in one dimen¬ 
sion, while the multi dimensional case will be dealt with in Ohs 
21—24. 

In Part III, we shall then turn to questions of testing the mathe¬ 
matical theory by experience, and usings the results of the theory for 
purposes of statistical inference. 


165 



Chapters 15 - 20 . Variables and Distributions in R v 


CHAPTER 15. 

General Properties. 

15.i. Distribution function and frequency function. — Consider a 
one-dimensional random variable By Axioms 1 and 2 of 14.1, g 
possesses a definite probability distribution in R x . This distribution 
may be concretely interpreted as the distribution of a unit of mass 
over Hi, in such a way that the mass quantity P(S) allotted to any 
Borel set S represents the probability that the variable ^ takes a 
value belonging to S. 

As we have seen in 6.6，we are at liberty to define the distribu¬ 
tion either by the non-negative and additive set function P(S) } which 
is called the^ probability function (abbreviated pr.f) of the variable 
or by the corresponding point function JF(a:) defined by the relation 

P(^x) = F(a;), 

which is called the distribution function .(abbreviated d.f) of f. In the 
present case of a one-dimensioAftl' distribution, we shall practically 
always use F(x). 

The reader is referred to the discussion of the general properties 
of a d.f. given in 6.6. In particular it has been shown there that 
any d.f. is a non-decreasing function of x y which is everywhere 
ooptinuous to the right, and is such that P (— oo) = 0 and _?’(+ ao) = 1. 
Th« difference F[b) — 》’(fl) represents' the probability that the variable 
? takes a value belonging to the interval a < f g 6: 

P(a < § ^ J) = F(b) - F(a). 

If z 0 is a discontinuity point of F(x )， with a saltus equal to p 0t 
it follows from 6.6 that the mass p 0 is concentrated in the point x 0t 
which means that we have the probability p 0 that the variable $ takes 
the value x 0 : 

■P(5 = 怎 0) 

166 



15.1 


If, on the other hand, the derivative jP^ (x) = f(x) exists in a 
certain point then f(x) represents the density of mass at this point 
and we shall call f(x) the probability density or the frequency function 
(abbreviated fr.f.) of the variable. The probability that the variable 
g takes a value belonging to the interval a; < 5 < a: + Jx is then for 
small Jx asymptotically equal to f{x) dx, which is written in the 
usual differential notation 


^ + dx) = f(x)dx. 

This differential will be called the probability element of the dis¬ 
tribution. 

Any function = of the random variable 5 by 14.5, itself 
a random variable, with a d. f. given by (14.5.2). We shall consider 
two simple examples, that will often occur in the sequel. 

In the case of a linear function ” = + 5， the relation rj ^ y is 

equivalent to g S (y — b)/a or to ? g (y — b)/a t according as a > 0 or 
a < 0. It then follows from (14.5.2) that rj has the d. f. 


(15.1.1) 


^(y) = 


\ F (^~) 


if a > 0, 
if a < 0, 


where F[x) denotes the d.f. of The formula for G (y) in the case 
fl < 0 is, however, only valid if (y — b)/a is a continuity point of F. 
In a discontinuity point, the function should, according to our usual 
convention, be so determined as to be always continuous to the right. 
If the fr.f. /(x) = 2 ?,/ (x) exists for all values of x y it follows that rj 
has the fr.f. 


(15.1.2) 


Q(y) = (y): 


R 


i^y 


Next, we consider the function tj = The variable rj is here always 
non-negative, and for y > 0 the relation r}Sy is equivalent to 
— Consequently rj has the d.f. 


(15.1.3) 



0 

F(V~y)-F(^Vi) 


167 


for y < 0, 
for yH 



15.1-2 


This time, the last expression is valid only if — is a continuity 
point of F. If the fr.f. f(x) == F" (x) exists for all x t it follows that 
T] has the fr. f. 

( 0 for y < 0, 

(認） Kv 卜 南 (/ cQ +/ (一 ㉞ fory>0 . 

Other simple functions may be treated in a similar way. 

15.2. Two simple types of distributions. — In the majority of 
problems occurring in statistical applications, we are concerned with 
distributions belonging to one of the two simple types known as the 
discrete and the continuous type. 

1. The discrete type. A random variable § will be said to be of 
the discrete type, or to possess a distribution of this type, if the 
total mass of the distribution is concentrated in diacret© mass points 1 ) 
and if, moreover, any finite interval contains at most a finite number 
of the mass points. By 6.2, the set of all mass points is finite or 
enumerable. Let us denote the mass points by a? 4 , x lt . . and the 
corresponding- masses by p t ，. • •• The distribution of ? is then 
completely described bj saying that, for every v, we have the proba¬ 
bility p 9 that ^ takes the value x ¥ : 

P(§ = 怎， ) ==p v . 

For a set S not containing any point x v we have, on the other hand, 

P(5<S) = 0. 

Since the total mass in the distribution must be unity, we always have 

2p* = 1 - 

V 

The d. f. F(x) is then given by 

(15.2.1) F(x)^P(^x)=2^ 

x v ^x 

the summation being extended to all values of v such that x v ^ x. 
Thus F(x) is a step-function (cf 6.2 and 6.6), which is constant over 

*) This corre«pond8 to the cam C| =■ 1, c, =* 0 in (6.6.2). 

168 



15*3 


•▼ery interval not containing anj point x 9 、but has in emoh x 9 a step 
of the height 

▲ distribution of the discrete type may be graphically represented 
bj means of a diagram of the function F(x) t or by a showing 

an ordinate of the height p 9 over each point x 9i m illastrated by 
Figrs 4 and 5. 



Fig. 4. Oistribation function of the discrete type. (Note that the mediAii is indeter- 

minsie; c( p. 178.) 


X( X, o X, *• 

Fig. 6. Probabilities corresponding to the distribution in Fig. 4. 

In statistical applications, y»riablea of the discrete type occur e. g. in cases where 
the rariable represents a certain number of no its of some kind. Examples are: the 
number of pigs in a litter, the number of telephone calls at » giren station daring 
one hour, the number of business fftilorea dnriDg one year. In such cases, the mass 
poijita Xv are aimplj the natural immb«ra 0, 1 ， 2, . • •• 

2. The continuous type. A variable 5 will be said to be of the 
continuous type, or to possess a distribution of this type, if the d. f. 
^(x) is everywhere continuous 1 ) and if, moreover, the fr. f. f(x) == F (x) 
•exists and is continuous for all values of x t except possibly in certain 
points, of which any finite interval contains at most a finite number. 
The d. f. F(x) is then 

f(x) = P(|g*) = / f(t)dt. 

—flO 

l ) This corresponds to the 0, ■■ 1 in (6.6.2). 

169 





15 * 2 — 3 


The distribution bos no discrete mass points, and consequently the 
probability that S takes a particular value x 0 is zero for every x 0 : 

= x 0 ) — 0. 

The probability that 互 takes a value belonging to the finite or infinite 
interval (a, ft) has thus the same value, whether we consider the inter¬ 
val as closed, open or half open, and is given by 

P(a<|<6) = F(b)~ F(a) = / f(t)dt. 

a 

Since the total mass in the distribution must be unity, we always have 

f f(t)dt=^ 1. 

— ao 

A distribution of the continuous type may be graphically repre. 
sented by diagrams showing the d. f. F(x) or the fr. f. /(a?), as illus¬ 
trated by Figs 6—7. The curve y = f(x) is known as the frequency 
curve of the distribution. 

In statistical applications, variables of the continaous type occur when we are 
concerned with the measurement of quantities which, within certain limits, may as¬ 
sume any value. Examples are: the price of a commodity, tbe stature of a man, the 
yield of a corn field. In such cases variables are treated as continuous, although 
strictly speaking the actual datft are practically always discontinuous, since every 
measurement is expressed by an integral multiple of the smallest unit registered in 
our observations. Thus prices ore expressed in money units, lengths may be expressed 
in cm and weights in kg, etc. When, for theoretical purposes, variables of this kind 
are considered as continuous, a certain mathematical idealization of actually observed 
facts is thus already implied. 


15.3. Mean values. — Consider a random variable ^ with the d. f. 
■F (: r), and let gr(§) be a function integrable over ( 一 oo, oo) with re¬ 
spect to F (cf 7.2). The integral 

00 

j dF(x) 

—co 

has，in 7.4, been interpreted as a weighted mean of the values of g(x) 
for all values of x ) the weights being furnished by the mass quantities 
dF situated in the neighbourhood of each point x. 

Accordingly we shall denote tbis integral as the mean value or 
mathematical expectation of the random variable g(^) y and write 

170 



Fig. 6. Distribution function of the continuous type. (Note that the distribution 
has & unique median at x 0 \ cf p. 178.) 



Fig. 7. Frequency function of the distribution in Fig. 6. The shaded area corresponds 
to the probability P(a < ^ ^ 6). The dintribation has a unique mode (cf p. 179) at c. 
The skewness (cf p. 184) is positive. 


00 

(15.3.1) E(g(^)) = f g(x)dF(x). 

— ao 

More generally, if § is a ^-dimensional random variable with the 
probability function P(S) t and if ^(§) is a one-dimensional function 
(particular case, ifc = 1 of 14.5) of g which is integrable over Rj with 
respect to P(8) } we define the mean value of 夕 (§) by the relation 

171 



15.3 


(15.3.2) 


E{g^))^g(x)dP(S). 


For a complex-valued function g(^) = a(g) + i 6 (§), we use the same 
formula to define the mean val 如 ， and thus obtain 

E(g(Sii^E(a(§)) + iE(b(g)). 

When there is no risk of a misunderstanding, we shall write simply 
Ej 7 (§) or E(g) instead of E(g(§)). 

In the case of a one-dimensional distribution of the discrete type, 
as defined in the preceding paragraph, the mean value reduces ac¬ 
cording to (7.1.8) to a finite or infinite sum: 

£ (卩 (5)) = 2 沪 ，咖， )， 

while for the continuous type, assuming g(x) to be continuous except 
at most ia a finite number of points, we obtain by (7.5.5) an ordinary 
Riemann integral : 

OD 

E {9(&) S== f g(x)f(x)dx. 


The condition that g should be integrable over (— oo, oo) with respect 
to F is, in the last two particular cases, equivalent to the absolute 
convergence of the series or integral representing the mean value. Thus 
it is only subject to this condition that the mean value exists. The 
condition is always satisfied in the particular case of a bounded func¬ 
tion g(^) } as pointed out in 7.4. 

Consider now two variables § and defined in the spaces R* and 
R" of any number of dimensions, with the pr. f：8 F x and P t respect¬ 
ively. Let g(^) and h(rj) be two real or complex functions such that 
the mean values Eg(§) and Eh(rj) both exist. We shall consider the 
sum 级 (§) + h(tj). By 14.5，this sum is a random variable, which may 
be regarded as a function of the combined variable (§, i^). If R de¬ 
notes the space of the combined variable, while P is the corresponding 
pr. f., the mean value of the sum has the expression 

+ h(ri )、== J (g(x) + h(y))dP = f g(x)dP + J h(y)dP. 

By (9.2.2) the last two integrals reduce, however, to 

172 



15.3 


/ g^dPi — Eg(^) and f h(yjdP t = E h(tj) 

K ， K" 


respectively, so that we obtain 

(15.3.3) E(g{^)^ h(ri)) = Eg(^) + Eh{ n ). 


The extension of this relation to an arbitrary finite number of terms 
is immediate, and we thus have the following important theorem : 
The mean value of a sum of random variables is equal to the sum of 
the mean values of the terms, provided that the latter mean values exist. 

It should be observed that this theorem has been proved without 
any assumption concerning the nature of the dependence between the 
terms of the sum. In the case of the mean value of a product, it is 
not possible to obtain an equally general result. Using the same nota¬ 
tions as above, we have 


E(g(^)h{ri))=- ^g(x)h(y)dP. 


In order to reduce this integral to a simple form, we now suppose 
that § and 7 \ are independent y so that the pr. f. P satisfies the multi¬ 
plicative relation (14.4.4). By the final remark of 14.5, the vari¬ 
ables g(%) and /i(^) are then also independent. On this hypothesis, 
the formula for the mean value reduces according to (9.3.1) to 

(15.3.4) E(g(§)h(rii) = f g(x)dP r f h(y)dP t = 

K K" 


The extension to an arbitrary finite number of factors is immediate, 
so that we have the following theorem : The mean value of a product 
of independent random variables is equal to the product of the mean 
values of the factors^ provided that the latter mean values exist. 

We finally consider some simple particular cases of the preceding 
general relations. — If ^ is a one-dimensional random variable, such 
that the mean value E(^) obtained by taking ^ (§) = ? in (15.3.1) exists, 
we have for any constant a and b 

(15.3.5) + fc) = aE(?) + 5. 

Putting E (f) = m we have, in particular, 

(15.3.6) E (5 — m) = m — m = 0. 

173 



15.3-4 


Taking ^ (|) == §, h(r/j ==tj in the addition theorem (15.3.3), we obtain 

(15.3.7) E(f + i ? ) = E(f) + E(rj). 

If 客 and 7j are independent, the multiplication theorem (15.3.4) gives 

(15.3.8) E{^fj)=^E(^)E(rj). 

15.4. Moments. — The moments of a one-dimensional distribution 
have been introduced in 7.4. If, for a positive integer v } the function 
x ¥ is integrable over (一 oo ， oo) with respect to the mean value 

ao 

(15.4.1) a，= E(^ 9 ) = j x 9 dF{x) 

—ao 

is called the moment of order y, or simply the v:th moment, of the 
variable or the distribution, and we say that the y：th moment is finite 
or exists. Obviously a 0 always exists and is equal to unity. 

If a ¥ exists, the function \x\ v is also integrable, bo that the v:th 
absolute moment 

(15.4.2) = / \x\ v dF(x) 

— oo 

exists. It follows that, if a* exists, then a v and ^ exist for 0 ^ ^ Ar. 

For a distribution of the discrete type, the moments are according 
to 15.3 expressed by the series 

a ^ = 2lPi x i' 

i 

and for a distribution of the continuous type by the Riemann integral 

00 

o* = J x 9 /(x)dx. 

— 00 

It is only in the case when the series or integral representing the 
moment is absolutely convergent that the moment is said to exist. 

The first moment a t is equal to the mean value, or briefly the 
mean t of the variable, and will often be denoted by the letter m: 

= jB(§) = m. 

If e denotes any constant, the quantities 

174 



15.4 

I 

00 

E [(§ — c) 9 ] == f (x— c)* dF(x\ 

— oo 

are called the moments about the point c. For c = 0, we obtain the 
ordinary moments. The absolute moments about c are, of course, de¬ 
fined in an analogous way. The moments about the mean m are often 
called the central moments. These are particularly important and de¬ 
serve a special notation. We shall write 

00 

(15.4.3) fi v = E [(g — mY] = j (x — m) 9 dF(x). 

— 00 

Developing the factor {x — m)' we find 

Mo = 1, 

Mi = 0, 

(15.4.4) = a % — w *， 

= a 8 — 3 m a, 4- 2 m 8 , 

^4 = o 4 — 4mc s 4 - 6 m* a t 一 3 w 4 , 


For the second moment about any point c, we have 
E [(§ 一 c)*] = £? [(f — m + »i — c) a ] 

=h + (c — w)* 2〜， 

so that the second moment becomes a minimum when taken about the 
mean. 

The moments of any function g(^) are the mean values of the sac- 
cessive powers of 夕 (§)• In the particular case of a linear function 
y(?) == a| + 6, the moment a v is given by the expression 

a:= 五 [(a g + 办 )’] —a v a 9 + I j ja*" -1 b a«-i +•" + &’. 

In 7.4, we have given a simple sufficient condition for the existence of the mo¬ 
ment of a given order k. We remark farther that, when the rariable ^ is bounded, 
i. e. when finite a and b can b© found such that P(a < ^ < 6) « 1, all moments are 
finite, and | 丨各 | a 卜 + | 6 卜 

We shall now prove an important inequality for the absolute mo¬ 
menta p 9 defined by (15.4.2). The quadratic form in u and v 

175 




15.4 


00 

J \«|*| j +»ki * / 


d F(x) = 卢 ， -i 14’ + 2fi 9 uv + fi 9 +i v 9 


is evidently non negative. Thus by 11.10 the determinant of the form 
is non-negative, so that we have ^ 0, or 

(15.4.5) 

Replacing here v successively by 1, 2, . . y, and multiplying all the 
inequalities thus formed, we obtain /9J +1 ^ /3J +1 , or finally 

(15.4.6) h=l,2，...). 


It is often important to know whether a distribution is uniquely 
determined by the sequence of its moments. We shall not enter upon a 
complete discussion of this difficult problem, but shall content our¬ 
selves with proving the following criterion that is often useful. 

Let a 0 = 1, Oj, a tt . . . be the moments of a certain d. f. all of 

ao 

which are assumed to be finite. Suppose that the series 2^T r * a ^olutely 

o 

convergent for some r > 0. Then F(x) is the only d.f. that has the mo¬ 
ments a 0) a Jt a s ,.... 

We shall first show that ~ r w -► 0 as ti oo. If n is restricted 

n\ 

to even values, this follows directly from our hypothesis, and for odd 
values of n we have by (15.4.5) 




r n ^ 




which completes the proof of our assertion. — For any integer m > 0 
and for any real z we have the MacLaurin expansion 








where & denotes a real or complex quantity of modulus not exceeding 
unity. Hence we obtain by means of (10.1.2) the following expansion 
for the c.f. q>(t) of F(x): 


176 



15.4-5 


tp(t -h h) = J e ,hx e ,tx d^(x) 

— 00 

== 2 Jx*e itx dF(x) 4- j \x\ n dF{x) 

0 

For J /i j < r the remainder tends to zero, so that for any t the c.f. 
gpQ + 九） can be developed in Taylor’s series, convergent for | /i | < r. 

Taking first 《 = 0， we find that the series (where we have written 
t in the place of h) 

(15.4.7) = 

0 

represents the function (p{t) at least in the interval — r < t < r. In 
this interval, q>{t) is thus uniquely determined by the moments cr，. 
In the points 《=土 J r, the series obtained by differentiating (15.4.7) 
any number of times is convergent, so that all the derivatives 
穸 (’) （土 I r) can be calculated from (15.4.7), i. e. from the moments a，. 
These derivatives appear as coefficients in the Taj lor developments of 
妒 (土 I ,• + 办 )， which converge and represent 9 ? ⑺ for | /i | < r, so that 
the domain where q> [t) is known is now extended to the interval 
—\ r < t < \ r. From the last developments, we can now calculate the 
derivatives in the points ^ = ± r, and use these as coefficients 

in the Taylor developments of y(± r + A), etc. In this way we may 
go on as long as we please, and it will be seen that by this procedure 
the c.f. g>(t) is uniquely defined hy the moments a，for all values of t. 1 ) 
It then follows from the uniqueness theorem (10.3.1) that the d. f. 
F(x) i 8 also uniquely determined by the a», and our theorem is prored. 

In the particular case when F(x) is the d. f. of a bounded variable, 
it follows from the remark made above that the condition* of the 
theorem are always satisfied. 


15.5. Measures of location.——In practical applications it is im¬ 
portant to be able to describe the main features of a distribution by 

l ) This is the method known as analytic continuation in the Theory of Analytic 
Functions. 


177 



15.5 


means of a few simple parameters. In the first place, we often want 
to locate a distribution by finding some typical value of the variable, 
which may be conceived as a central point of the distribution. There 
are various ways of calculating such a typical parameter, and we shall 
here discuss the three most important cases, viz. the mean, the median, 
and the mode. 


The mean E(5) = w is the first moment of the distribution, and 
has already been defined in the preceding paragraph. In terms of our 
mechanical interpretation of the probability distribution as a distribu¬ 
tion of mass, the mean has an important concrete significance: it is 
the abscissa of the centre of gravity of the distribution (cf 7.4). This 
property gives the mean an evident claim of being regarded as a 
typical parameter. 

The median. 一 If x 0 is a point which divides the whole mass of 
the distribution into two equal parts, each containing the mass i ， x 0 
is called a median of the distribution. Thus any root of the equation 
jP(x) = 1 is a median of the distribution. In order to discuss the 
possible cases, we consider the curve y — F (x), regarding any vertical 
step as part of the curve, so that we have a single connected, never 
decreasing curve (cf Figs 4 and 6). This curve has at least one point 
of intersection with the straight line y = J. If there is only one point 
of intersection, the abscissa of this point is the unique median of the 
distribution (cf Fig. 6). It may, however, occur that the curve and the 
line have a whole cloaed interval in common (cf. Fig. 4). In this case 
the abscissa of every point in the interval satisfies the equation 
= i, and maj thus be taken as a median of the distribution. 

We thus see that every distribution has at least one median. In the 
determinate case, the median is uniquely defined; in the indeterminate 
case 、 every point in a certain closed interval ig a median. 

The mean, on the other hand, does not always exist. Even in 
cases when the mean does exist, the median is sometimes preferable 
as a typical parameter, since the value of the mean may be largely 
influenced by the occurrence of very small masses situated at a very 
large distance from the bulk of the distribution. 

▲8 shown in the preceding paragraph, the mean is characterized 
by a certain minimum property : the second moment becomes a mini¬ 
mum when taken about the mean. There is an analogous property of 
the median : the first absolute moment JB(|| — c|) becomes a minimum 


when c is equal 
determi: 


ainate case, 


to the median. This property holds even in the in- 
and the moment has then the same value for c equal 


178 



15.5-6 


to any of the possible median values. Denoting the median (or, in 
the indeterminate case, any median value) by /i, we have in fact the 
relations 


£(l 卜 c|) 


c 

JB(| 卜 /i|) + 2/ {c-x)dF{x) 
五 (I? 一 p|) + 2 /(x — c)dF(x) 


for Ofi y 


» c</x. 


The second terms on the right band sides are evidently positive, ex¬ 
cept in the case when c is another median value (indeterminate case), 
when the corresponding term is zero. 1 ) The proof of these relations 
will be left as an exercise for the reader. 

The mode of a distribution will only be defined for distributions 
of the two simple types introduced in 15.2. For a distribution of the 
continuous type, any maximum point x 0 of the frequency function f(x) 
is called a mode of the distribution. A unique mode thus only exists 
for frequency curves y = f(x) having a single maximum (cf Fig. 7); 
such unimodal distributions occur, however, often in statistical applica¬ 
tions. When the frequency curve has more than one maximum, the 
distribution is called bimodal or multzmodal, as the case may be. — 
For a distribution of the discrete type, we may suppose the mass 
points x v arranged in increasing order of magnitude. The point x v 
is then called a mode of the distribution, if p v > p v - t and p v > p v +i. 
The expressions unimodal, bimodal and multimodal distributions are 
here defined in a similar way as for continuous distributions. 

In the particular case when the distribution is symmetric about a certain point a, 
we have ^(a + x) + F{a — x) = 1 as soon as a + s are continuity points of It 
is then seen that the mean (if existent) and the median are both equal to a. If, in 
addition, the distribution is unimodal, the mode is also equal to a. 


15.6. Measures of dispersion. 一 When we know a typical value 
for a random variable, it is often required to calculate some parameter 
giving an idea of how widely the values of the variable are spread 
on either side of the typical value. A parameter of this kind is called 
a measure of spread or dispersion. It is sometimes also called a 
measure of concentration. Dispersion and concentration vary, of course, 

” In the particular case when is a discontinuity point of the ordinary de¬ 
finition of the integrals in the second members must be somewhat modified, as the 
integrals should then in both cases include half the contribution arising from the 
discontinuity. 


179 



15.6 


in inverse sense : the greater the dispersion, the smaller the concentra¬ 
tion, and conversely. 

If our typical value it the mean m of the distribution, it seems 
natural to consider the second moment about the mean, as a 
dispersion measure. This is called the variance of the variable, and 
represents the moment of inertia of the mass distribution with respect 
to a perpendicular axiB through the centre of gravity (cf 7.4). We 
have, of course, always fx t ^ 0. When fi t = 0, it follows from the de¬ 
finition of fi t that the whole mass of the distribution must be con¬ 
centrated in the single point m (cf 16.1). 

In order to obtain a quantity of the first dimension in units of 
the variable, it is, however, often preferable to use the non negative 
square root of /i s , which is called the standard deviation (abbreviated 
s. d.) of the variable, and is denoted by D ⑹ or sometimes by the 
single letter a. We then have for any variable such that the second 
moment exists 

= £(?*)-£*(?). 


It then folioW8 from (15.3 5) that we have for any constant a and b 

+ 6) = |a|D(f). 

When $ is a variable with the mean m and the s. d. <r, we shall 
often have occasion to consider the corresponding standardized variable 
芒 * "* Yi% 

-which represents the deviation of I from its mean wi，expressed 

in units of the a. d. a. It follows from the last relation and from 
(15.3.5) that the standardized variable has zero mean and unit 8. d.: 


叶升 0 ， D(i^) = l. 

If 5 and rj are independent variables, it further follows from (15.3.8) 
that we have 

(15.6.1) 

This relation is immediately extended to any finite number of terms. If 
I,,... are independent variables, we thus obtain 

(15.6.2) D ， ⑷十 ..+ §„) = D*(|,)+ - + 

We have seen that the second moment is a minimum when taken about 
the mean, and the first absolute moment when taken about the median 


180 



15.6 

(cf 15.4 and 15.5). If we use the median n as our typical value, it 
thus seems natural to use the first absolute moment 

as measure of dispersion. This is called the mean deviation of the 
variable. Sometimes the name of mean deviation is used for the first 
absolute moment taken about the mean, but this practice is not to be 
recommended. 

In the same way aa we have defined the median by means of the 
equation F(x) — J, we may define a quantity by the equation 
― P> where p is any given number such that 0 〈尸 < 1. The 
quantity will be called the quantile of order p of the distribution 
Like the median, any quantile may sometimes be indeterminate. 
The quantile ^ is, of course, identical with the median The know¬ 
ledge of for some set of conveniently chosen values of p y such as 
jp = }, J, J, or p = 0.1, 0 2 ， . .0 9, will obviously give a good idea of 

the location and dispersion of the distribution. The quantities and ^ 
are called the lower and upper quarttl.es 、 while the qualities Co i, Co 2 ,.. 

. C? 

are known as the deciles. The halved difference is sometimes 

u 

used as a measure of dispersion under the name of semi-interquartile 
range. 

If the whole mass of the distribution is situated within finite 
distance, there is an upper bound g of all points x such that F(x) — 0, 
and a lower bound G of all x such that F{x) — 1 The interval [g, G) 
then contains the whole mass of the distribution The length <i — g 
of this interval is called the range of the distribution, and may be 
used as a measure of dispersion. 

The word range is sometimes also used to denote the interval (g, G) 
itself If we know this interval, we have a fairly good idea both of the 
location and of the dispersion of the distribution. For a distribution 
where the range is not finite, intervals such as (m — a, w? + a) or (^, Jj), 
although they do not contain the whole mass of the distribution, may 
be used in a similar way, as a kind of geometrical representation of 
the location and dispersion of the distribution (cf 21 10) 

All measures of location and dispersion, and of other similar pro¬ 
perties, are to a large extent arbitrary. This is quite natural, since 
the properties to be described by such parameters are too vaguely 
defined to admit of unique measurement by means of a single number 

181 



15.6-7 


Each measure has advantages and disadvantages of its own, and a 
measure which renders excellent service in one case may be more or 
less useless in another. 

If, in particular, we choose the variance a 1 or the s. d. a as our 
measure of dispersion, this means that the dispersion of the mass in 
a distribution with the mean m — 0 is measured by the mean square 

ao 

■ =•/ x*dF(x). 

— 00 

The concentration of the variable § about the point w = 0 will be 
measured bj the same quantity : the smaller the mean square, the 
greater the concentration, and conversely. Thus the mean square of 
a variable quantity is considered as a measure of the deviation of this 
quantity from zero. This is a way of expressing the famous principle 
of least squares ， that we shall meet in various connections in the 
sequel. 一 It follows from the above that there is no logical necessity 
prompting us to adopt this principle. On the contrary, it is largely 
a matter of convention whether we choose to do so or not. The main 
reason in favour of the principle lies in the relatively simple nature 
of the rules of operation to which it leads. We have, e. g., the simple 
addition rule (15.6.2) for the variance, while there is no analogue for 
the other dispersion measures discussed above. 

15.7. Tchebycheff’s theorem. 一 We shall now prove the following 
generalization of a theorem due to Tchebycheff : 

Let 夕 ㈤ be a non-negative function of the random variable §. For 
every K > 0 we then have 

(15.7.1) 

where P denotes as usual the pr.f. of 

If we denote by S the set of all § satisfying the -inequality g (§) ^ K y 
the truth of the theorem follows directly from the relation 

00 

Eff(§) = f g(x)dF^KfdF=KP(S). 

— ao S 

It is evident that the theorem bolds, with the same proof, even when 
5 is replaced by a random variable | in any number of dimensions. 

Taking in particular ^(|) = (g 一 m) 1 , iT = A;* a*, where m and a 

182 



15 . 7-8 


denote the mean and the s. d. of we obtain for every jfc > 0 the 
Bienayme-Tchebychejff inequality: 

(15.7.2) • P(||-»«|S*a)Sp- 

This inequality shows that the quantity of mass in the distribution 
situated outside the interval m — ka<^<m + ka is at most equal 

to 二 ， and thus gives a good idea of the sense in which a may be 

used as a measure of dispersion or concentration. 

For the particular distribution of mean m and a. d. a which has a mass 

in each of the points x ^ m ± ka, and a mass 1 — ^ in the point x == m, we have 

-P(| $ 一 w I ^ /c a) = and it is thus seen that the upper limit of the probability 

given by (16.7.2) cannot generally be improved 

On the other hand, if we restrict ourselves to certain classes of distributions, it 
is sometimes possible to improve the inequality (16.7.2). Thus it was already shown 
by Gauss in 1821 that for a ummodal distribution (cf 16.6) of the continuous type 
we have for every 灸 > 0 

(16.7.3) P(\S-x 0 \^kx)^~, 

where x 0 is the mode, and t* = a 1 4 - Or 。 一 m)* is the second order moment about 
the mode. A simple proof of this relation will be indicated in Ex. 4 on p. 256. 
Hence we obtain the following inequality for the deviation from the mean: 

(16 

for every /r > | 丨 |，where 丨 denotes the Pearson measure of skewness defined by 

(16.8.3) . For moderate values of | 8 this inequality often gives a lower value to the 
limit than (16.7.2). Thus if | * | < 0.26, the probability of a deviation exceeding 3 a 
is by (16.7.4) smaller than 0.0624, while (16.7 2) gives the less precise limit 0.1111. 
For the probability of a deviation exceeding 4c t the corresponding figures are 0.0336 
by (15.7.4), and 0.0626 by (16.7.2). 

15.8. Measures of skewness and excess. 一 In a symmetric distri¬ 
bution, every moment of odd order about the mean (if existent) is 
evidently equal to zero. Any such moment which is not zero may 
thus be considered as a measure of the asymmetry or skewness of the 
distribution. The simplest of these measures is p s , whicb is of the 
third dimension in units of the variable. In order to reduce this to 
zero dimension, and so construct an absolute measure, we divide by 
a 8 and regard the ratio 


183 





(15.8.1) =° 

as a meuure of the skewness. We ahall call y t the coefficient of 
skewness. 

In statistical applications, we often meet nnimodal continuous 
distributions of the type shown in Fig. 7, where the frequency curve 
forms a »lon^ taii» on one side of the mode, and a >short tail* on 
the other side. In the curve shown in Fig. 7, the long tail is on the 
positive side, and in fi t the cubes of the positive deviations will then 
generally outweigh the negative cubes, so that y x will be potitiTe. 
We shall call this a distribution of positive skewness. Similarly we 
hare negative skewness when is negatire; the long tail will then 
generally be on the negative side. 

lUdacing the fourth moment fi 4 to zero dimension in the same 
wftj u above, we define the coefficient of excess 

(16.8.2) h = 


which is sometimes used as a measure of the degree of flattening of 
a frequency curre near its centre. For the important normal diitribu- 
tion (cf 17.2), y t is equal to zero. Positive values of y, are supposed 
to indicate that the frequency curve is more tall and slim than the 
normal curve in the neighbourhood of the mode, and conversely for 
negative values. In the former case, it is usual to talk of a positive 
excess, as compared with the normal curve, in the latter case of a 
negative excess. This usage is, however, open to certain criticism 
(cf 17.6). 

In the literature, the quantities /9, = y\ and = y, + 3 are often used instead 
of y, and y t . 

Many other measures of skewness and excess have been proposed. Thus K. 
Peurnon introduced the difference between the mean and the mode, divided by the 8. d.: 


(16.8 3) 


m — x q 
a 


m % measure of skewoess. For the class of distributions belonging to the Pearton 
■yatem (cf 19.4), it can be shown that 

= + e) 

2(6y, - + 6) 


When y, and y t ire small, this gives approximately 

* = Jyi or x 9 ^ m — i 0 . 
184 



15.8 — 10 


The last relation »lso holds approximately for distributions given by the Edgeworth 
or Chftrlier exp»naions (cf 17.6 一 17.7). Charlier used the coefficient 5 « — iy, m 
measure of skewness, and E — \ y t aa meuure of excess. 

15.9. Characteristic functions. 一 The mean value of the particular 
function e (t ^ will be written 

oo 

(15.9.1) (p{t) = — f e itz dF(x). 

— CO 

This is a function of the real variable t, and will be called the charac¬ 
teristic function (abbreviated c. /.) of the variable or of the corre¬ 
sponding distribution. The reader is referred to the discussion of the 
mathematical theory of characteristic functions given in Ch. 10. 

It follows in particular from this discussion that there is a one- 
to-one correspondence between distributions and characteristic func¬ 
tions. If two distributions are identical, so are their c. f: 8 , and con¬ 
versely. This property has important consequences. In many problems 
where it is required to find the distribution of some given random 
variable, it is relatively easy to find the c. f. of the variable. If this 
is found to agree with the c. f. of some already known distribution, 
we may conclude that the latter must be identical with the required 
distribution. 

The c. f. of any function 夕 (§) is the mean value of e itg ^\ In the 
particular case of a linear function ^ (5) = a 5 + ft the c. f. becomes 

(15.9.2) E(e“(G +6 )) = 矽 “ 炉 (d). 

Thus e. g. the variable — § has the c. f. ^p(— t) = (p{t). Further, the 
standardized variable (§ 一 m)/a has the c. f. 



15.10. Semi - invariants . — If the 左 .th moment of the distribution 
exists, the c. f. may according to (10.1.3) be developed in MacLaurin s 
series for small values of t: 

(15.10.1) q>(t) = 1 + + o( 产 ). 

l 

For the function log (1 4 - e) we have the corresponding development 
log (1 + ^) = j — ~ • ± ^ + o(r*). 


185 



15.10 


Replacing here 1 + by <p(t), we obtain after rearrangement of the 
terms a development of the form 

k 

(15.10.2) log y W = 2 $ + 0 (P). 

l ' 

The coefficients x，were introduced bj Thiele (Ref. 37), and are called 
the semi-invariants or cumulants of the distribution. 

In order to deduce the relations between the moments or* and the 
aemi.invariants X ，， we may use the identities 

logyW = log (i+i>) ， )=|>0 ，， 

l 

in a purely formal way, without paying any attention to questions 
of existence of moments or convergence of series. It is seen that x» 
is a polynomial in a” • . or n ，and conversely a n is a polynomial in 
Xi, . . M x». In particular we have 

x t = o?! = m, 
x, = aj — a? = or ’， 

(15.10.B) = a, —- 3 a, a, 4- 2 aj, 

= or 4 — 3 cri — 4 ffj a 8 -f 12 cr! ^ — 6 a* u 


and conversely 

a i ^ 

O f = X ，+ X?, 

(15.10.4) a s == x 5 -f 3 x, x # + xj f 

a 4 = x 4 4- 3 x} -f 4 x x x B + 6 xj x t + x{ f 


In terms of the central moments ^ the exprevsions of the x，become 

186 • 





15 . 10-11 


x, = 

h = " a , 

(15.10.5) x 4 = /i 4 一 3 fily 

h = A — 

% = A 一 — 10 弘； + 30^*1, 


so that the coefficients of skewness and excess introduced in 15.8 are 

yi = 袅 and yj = g- 

The semi-invariants x: of a linear function g (|) = a | + b are, by 
(15.9.2), found from the development 

log [^ it <p(at)] = '2j~{ity + o(<*). 

1 

Comparing with (15.10.2), we obtain the expressions 

xi = a x, -f* 6, and x* = a 9 x* for y > 1. 

15.U. Independent variables. Let ^ and rj be random variables 
with the d. f:s and F ty and the joint pr. f. P. By (14.4.5) a ne¬ 
cessary and sufficient condition for the independence of I and tj is 
that the joint d. f. of the variables is, for all x and y, given by the 
expression 1 ) 

(15.11.1) F(x,y) = P(§ y)=^F x {x)F t (y). 

When both variables have, distributions belonging to the same simple 
type, the independence condition may be expressed in a more con¬ 
venient form, as we are now going to show. , 

Consider first the case of two variables of the discrete type, with 
distributions given by 

_P(g = a:，) = p ，， p(i? = y») = q Vl 

where v = l t 2, . . . It is then easily seen that the independence con¬ 
dition (15.11.1) is equivalent to 

( 15 . 11 . 2 ) = n =■ y，、= PA ， 
for all values of fi and v. 

l ) Another necessary and sufficient condition will b« given in 21.3. 

187 




15.11-12 


In the case of two Tariabiet of the continuous type, the independ¬ 
ence condition (15.11.1) maj be differentiated with respect to x and y, 
and we obtain 

(1611-3) /(x, y) = = /,(x)/ s (y), 

where , and /• are the fr. f:8 of 5 wad i;, while / if according to 
8.4 the fr. f. of the joint distribution, or the joint fr.f. of § and tj. 
Con Tersely, from (15.11.3) we obtain (15.11.1) bj direct integration. 

Thus a necessary and sufficient condition for independence is given 
by (15.11.2) in the case of two discrete variables, and by (15.11.3) in 
the case of two continuous variables. Both conditions immediately extend 
themselves to an arbitrary finite number of variables. 

15.12. Addition of independent varlmbles. 一 Let f and rj be in¬ 
dependent random variables with known distributions. By 14.5, the 
gum ^ + rj bas a distribution uniquely determined bj the distributions 
of f and tj. In manjr problems it is required to express the d. f., the 
c. f., the moments etc. of this digtribution in term* of the corresponding 
functions and quantities of the given diatribution* of 5 and rj. The 
problem may, of course, be generalized to a «um of more than two 
independent variables. 

We shall first consider the c. f.s. Let- <p x (t), (p t (t) • and q>(t) denote 
the c. f：s of 5, rj and 5 + 巧 respectively. We then have, by the theo¬ 
rem (15.3.4) on the mean value of a product of independent factors, 

9 (t) =： ^n) 

= E(# 、 :)E(e^) = %(0?>,W. 

This relation is immediately extended to an arbitrary finite number 
of variables. If are independent variables with the c. f:s 

9iW, . • ” 9 ?n ⑺， the c. f. gp ⑺ of the sum 仏 + -f is thus given by 
the relation 

(15.12.1) (px(t)q> % (t) • • • 9P»(4 

so that we have the following important theorem, which expresses a 
fundamental property of the c. f ： B. 

The characteristic function of a sum of independent variables is equal 
to the product of the characteristic functions of the terms. 

We now want to express the d. f. of the sum ^ rj by means of 
the d. f：8 F x and F g of the terms. This problem will be treated m an 

188 



15.12 


example of the general method (cf 10.3 and 15.9) of fiDdd'n^ a d. f. 
with the aid of its c. f. Consider the integral 

0D 

F(x) = f FA^-^dF,^). 

— 00 

Since i\ is bounded, this integral has by 7.1 a finite and determined 
value for every x. Now F x [x 一 z) is, for every fixed z y a never de¬ 
creasing function of x which is everywhere continuous to the right, 
and tends to 1 as : r — + oo ， and to 0 as x — co. Consider the 
difference F(x + /i) — F(x) y where h> 0. It follows from (7.1.4) that 
this difference is non-negative, and from (7.3.1) that it tends to zero 
with h. It further follows from (7.3.1) that F(x) tends to 1 as 
丨- ►十 cc, and to 0 as x -« Thus F(.r) ia a d. f. The corre¬ 
sponding c. f. 

OD 

J e ltx dF(x) 

— 00 

is, by (7.5.6), the limit as w — qo of a sum s n of the form 
s n = ^e t4x * [F (: r，) 一 F(x r _i)j, 

i 

provided that the maximum length of the sub-intervals (x r -i, x v ) tends 
to zero, while x 0 -* — oo and x n -* + ① Introducing here the inte 
gral expression of ■F(x), we obtain 

ao 

s n = f s n e ltz dF x (z) y 

—ao 

where 

1 

xv = 一之 . 

As n ^od f Sn tends for every fixed z to the limit 

lim ^n = J e 1tz d F, (x) = <p x (t). 

— 00 

Further, Sn is uniformly bounded, since we have 

K I ^ 2 W ( x ’，) 一 ^ 1 


189 



li.12 


According' to (7.1.7) it then follows that 

lim 8n gpi ⑻ / e tta dF t (z) = ^(0 

Thus the c. f. of F(x) is identical with the c. f. g>(t) = q>i(t)q> t (t) of the 
sum g + iy, so that F(x) is the required d. f. Since the functions F, 
and may eTidentlj be interchanged without affecting the proof, 
we hare established the following theorem : 

The distribution function F(x) of the sum of two independent vari¬ 
ables is given by the expression 

(15.12.2) JP(*) = / = / 

— 00 — 00 

where F t and F t are the distribution functions of the terms. 1 ) 

When three d. f:s satisfy (15.12.2), we shall say that F is composed 
of the components F x and F t) and we shall use the abbreviation 

(15.12.2 a) F (a:) ==• F x (a:) * F t (x) = F t (a:) * F x (x). 

By (15.12.1), this symbolical multiplication of the d. f:s corresponds to 
a genuine multiplication of the c. f:a. 

If the three variables 5i and 5s are independent, an evident 
modification of the proof of (15.12.2) shows that the sum + & + & 
has the d. f. (F, * jP,) * F g = jPj * (F, * F,). Obviously this may be 
generalized to any number of components, and it is seen that the opera¬ 
tion of composition is commutative and associative. For the sum 
g, + • •• + f„ of n independent variables we have the d. f. 

(15.12.3) F = ^ * 2^ *. . . * 八 . 

Let us now consider the following two particular cases of the 
composition of two components according to (15.12.2): 

a) Both components belong to the discrete type (cf 15.2). 

b) Both components belong to the continuous type, and at least 
one of the fr. f:s, say f x = F\ t is bounded for all x. 

In case a), let . . . and y u y ft . . . denote the discontinuity 

points of F x and F, respectively. It is then evident that the total 

*) The reader should try to construct % direct proof of this theorem, without the 
use of characteristic fanctions. It is to be proved that, in the two-dimenaional 
distribution of the independent ▼•riables 态 and tf t the mass quantity F(x) situated 
in the half-plane {+ <7 ^ x ia giren by (16.12.2). Cf. Cramer, Ref. 11, p. 86. 

190 



15.12 


mass of the composed distribution is concentrated in the' points 
x r + y*，where r and 8 independently assume the values 1,2,... If 
the set of all these points has do finite limiting point, the composed 
d. f. thus also belongs to the discrete type. This is the case e. g. 
when all the x r and y $ are non negative, or when at least one of the 
sequences {ov} and [y g \ is finite. 

In case b), the first integral in (15.12.2) satisfies the conditions for 
derivation with respect to x (cf 7.3.2). Further, by (7.3.1) and (7.5.5), 
the derivative 2^ (a;) = f{x) is continuous for all x, and may be ex¬ 
pressed as a Riemann integral 

(15.12.4) f(x) = f/Ax — z)f t (z) = — z)f v (z) dz. 

—CO — ao 

Thus the composed distribution belongs to the continuous type, and 
the fr. f. f(x) is everywhere continuous. 

Returning to the general case, we denote by w,, m t and m the 
means, and by cTj, a t and a the s. d:s of 5, rj and 专 + t] respectively. 
Since 5 and rj are independent, we then have by (15.3.7) and (15.6.1) 

(15.12.5) m == m l + m 2 , cr* = o ? 十 d 

For the higher moments about the mean, a general expression is de¬ 
duced from the relation 

〜=E [(I + w — m)，] =E[(^ — m l + rj — m s ) v ]. 

Since any first order moment about a mean is zero, we have in par¬ 
ticular, using easily understood notations, 

弘 8 = + M 2 ’， 

(15.12.6) ^4 = M l) + + M 2) ， 


The composition formulae for moments are directly extended to 
the case of more than two variables. For the addition of n indepen¬ 
dent variables, we thus have the following simple expressions for the 
moments of the three lowest orders : 

w = m, -I- + + m n , 

(15.12.7) <x 5 = a? 4 a* + • +< 

Ms = M+ P? 1 十 .+ 〆:)• 

For the higher moments (y > 3), the formulae become more complicated. 

191 




15 . 13 - 16.1 


Finally, W6 shall consider the semi-iiiTariAnts of the composed 
diatribation. The multiplication theorem for characteristic functions 
IfiTes db 

log g> (t) = log gpj (<) ■+■ log (^). 

Hence we obtain bj (15.10.2) x, = + xf. This simple composition 

rule is the chief reason for introducing the aemi-ixiTariants. The ex¬ 
tension to the case of n independent yariables is immediate and gireu 

(15.12.8) x, = xf + xf + … + xf. 


CHAPTER 16. 

Various Discrete Distributions. 


16.1. The function «(x). 一 The simplest discrete distribution has 
the total mass 1 concentrated in one single point, say in the point 
x = 0. This is the distribution of a variable 5 which is »almost al¬ 
ways > equal to zero, i. e. each that P(§ = 0) = 1. The corresponding 
d. f. is the function $(x) defined by (6.7.1): 


(16.1.1) 





for cc < 0, 

» x ^ 0. 


The c. f. is identically equal to 1, as we have already remarked in 10.1. 

More generally, a »variable* which is almost always equal to x 0 
has the d. f. «(x — x 0 ) and the c. f. The mean of this yariable 

is x 0 , and the 8. d. is *ero. Conversely, if it is known that the 8. d. 
of a certain variable is equal to zero, it follow# (cf 15.6) that the 
whole mass of the distribution is concentrated in one single point, so 
that the d. f. must be of the form «(x — x 0 ). 

The general d. f. of the discrete type aa given by (15.2.1) may be 
written 

(16.1.2) F(x)=^ 2 沪， 《 — 怎 ，). 


Let us consider the particular case of a discrete variable the distri¬ 
bution of which is specified in the following way: 


(16.1.3) 



with the probability p ， 

» » > q=s l —p. 


192 



16.1-2 


In the following paragraph, we shall make an important uBe of 
variables possessing this distribution. From (16.1.2) we obtain the 
d. f. of s 

F(x) — 11 + (x). 

and hence the c.f. 

(16,1.4) <p (t) = \) e i, + '/= 1 +p(〆’ 一 1}. 

The mean and variance of | are 

y D*(|) ~ E((g = p(l — p)* + g(0 — p) f == pq. 

16.2. The binomial distribution. 一 Let 6 be a given random ex¬ 
periment, and denote by E an event having a definite probability p 
to occur at each performance of ©. Consider a series of u independent 
repetitions of (J (cf 14.4), and let us define a random variable Hr 
attached to the r：th experiment by writing 

I 1 when E occurs at the r:tb experiment (probability = /?), 

= •! 

I 0 otherwise (probability = ^ == 1 一 _p). 

Then each has the probability distribution (16.1.3) considered in 
the preceding paragraph, and the variables 匕 ， •••，are independent. 

Obviously | r denotes the number of occurrences of IE in the r:th ex¬ 
periment, so that the sum 

v = ^ + I + . + §，• 

denotes the total mtmber of occurrences of the event E in our series of 
m repetitions of the experiment G. 

Since v is a Bum of v independent random variables, it is itself 
a random variable 1 ), the distribution of which may be found by the 
methods developed in 15.12. Thus we obtain by (15.12.7) and (16.1.5) 
the following expressions for the mean, the variance and the b. d. of v: 

(16.2.1) E (y) == np } D*(v) = npq } D(v) = V npq. 


l ) Throughout the general theory developed in the preceding chapters,have 
systematically used the letters { and rj to denote random ▼ariables. From now on it 
would, however, be inconvenient to adhere strictly to this rule. We shall thus often 
find it practical to allow any other letters (Greek or italic) to denote random variable*. 
It will thus always be necessary to observe with great care the significance of tbe 
various letters us«d in the formulae. 


19：3 



16.2 


The ratio v/n expresses the frequency of E in our series of n repeti¬ 
tions. For the mean and the s. d. of v/w, we have 

The c. f. of v is by (15.12.1) equal to the product of the c. f:s of 
all the §r } and thus we obtain from (16.1.4) 

(16.2.3) E(^ tv ) = (pe lt -f- q) n = (1 + p (e lt . — l)) n . 

Developing the first expression by the binomial theorem, we find 

w )=2 ⑴ pW r . 

，=o ' / 

By (10.1.5) this is, however, the c. f. of a variable which may assume 

the values ? = 0, 1, . . n with the probabilities P r = 

Owing to the one-to-one correspondence between distributions and char¬ 
acteristic functions, we may thus conclude (cf 15.9) that the probability 
distribution of v is specified by the relation 

(1C.2.4) P(v^r) = P r = ^p r q n ~ r (r = 0, 1,. . «). 

This is the binomial distribution, the simplest properties of which we 
assume to be already known. It is a distribution of the discrete type, 
involving two parameters n and where n is a positive integer, 
while 0 < p < 1. (The cases jp = 0 and /? = 1 are trivial and will be 
excluded from our discussion.) The corresponding d.f. 

(1«.2.5) B,(a ： ; P ) = P(vg a： )= 2it)p rqn ~ r 

ri* ' / 

is a step-function, with steps of the height P r in the n + 1 discrete 
mass points r = 0, 1, . . «. • 

In order to find the moments fi r about the mean of the binomial 
distribution, we consider the c.f. of the deviation v—np. This is 

194 



16.2 


e -npft( pe it + q yi 

= (pw" + q e-P i( ) n 

=[ 乏 (py +«(—p) r ) 等] . 

Thus all moments fi r are finite and may be found by equating coeffi¬ 
cients in the relation 

® fr r 00 Jr~] n 

2^ r H = ^ pqr f • 

In particular, we find 

a* ==(；*== 

( 16 . 2 . 6 ) ^^npq{q—p\ 

= 3nW + npq(l — 6pq) t 


For the coefficients of skewness and excess, we thus have the expressions 

= y t= ^一 3。 1 二組 

a Vnpq Vnpq 。 npq 

The skewness is positive for p < negative for p > and zero for 
p = J. Both coefficients /, and y t tend to zero as « oo. 

v y and v t denote two independent variables, both having bino- 
mial distributions with the same value of the parameter p y and with 
the values n x and n % of the parameter n. We may, e. g. t take v 1 and 
equal to the number of occurrences of the event E in two indepen¬ 
dent series of n t and n t repetitions of the experiment 6. 

The sum v x -f v t is then equal to the number of occurrences of E 
in a series of n t 4* n, repetitions. Accordingly the c. f. of v x + v t is 
(cf 15.12) 

=(p e u -f g) n * (p + qY» 

= ( ， 〆* + q)^ +n ^. 

This is the c. f. of a binomial distribution with the parameters p and 
+ Thus the addition of two independent variables with the 
B Hi (x; p) and p) gives (as may, of course, also be directly 

195 




16.2-3 


perceived) a variable with the d. f. i?/.,+n s (a:; p). In the abbreviated 
notation of (15.12.2 a) this may be written 

K 怎 ; P) * 2>) = B ni +n t (x ； ；>). 

Thus the binomial distribution reproduces itself by addition of indepen¬ 
dent variables. We shall call this an addition theorem for the bino¬ 
mial distribution. Later, we shall see that similar (but less evident) 
addition theorems hold also for certain other important distributions. 


16.3. Bernoulli's theorem. 一 For the frequency ratio v/n considered 
in the preceding paragraph, we have by (16.2.2) 


4 卜， D Ol/?- 


We now apply the Bienayme-Tchebychef inequality (15.7.2)，taking 




where e denotes a given positive quantity. Denoting by 


P the probability function of the valuable v, we then obtain the 
following 1 result : 


(16.3.1) 


55 ^ W 


If J denotes another given positive quantity, it follows that, as soon 

as we take n > ~-- <t the probability on the left hand side of (16.3.1) 

becomes smaller than 6 Since 6 is arbitrarily small, we have proved 
the following- theorem. 

The probability that the frequency v/n differs from its mean value 
p by a quantity of modulus at least equal to e tends to zero as " 一 > 
however small « ：> 0 chosen. 

This is, in modern terminology, the classical Bernoulli theorem ， 
originally proved by James Bernoulli, in his posthumous work Ar& 
Conjee tar? dt (1713)，in a quite different way. Bernoulli considered the 
two complementary probabilities 



196 



16.3 


and proved by a direct evaluation of the terms of the binomial ex- 

1 ~* 

pantion that, for any given s > 0 t the ratio - may be made to 

tJ" 

exceed any given quantity by choosing n sufficiently large. 


The vari»ble v is, according to the preceding paragraph, attached to a combined 
experiment, coDiiitiog in a series of n repetitions of the original experiment (&. Thus 
by 18.6 »nj probability statement with respect to v is a statement concerning the 
Approximate ralae of the frequency of some specified event in a series of repetitions 
of the combined experiment. The frequency interpretation (cf ] 3.6) of any such 
probability statement thas always refers to a series of repetitions of the combined 
experiment. 

Consider e.g. the frequency interpretation of the probability xs defined above. 
We begin by making a series of n repetitions of the experiment (&, and noting the 
number v of occarrenceii of the event E. This is our first performance of the com¬ 


bined experiment. If the observed number v satisfies the relation 


--V ^ 
n 


e t we 


•ay that the event E 1 occurs in the first combined experiment. The event K* ban 
then the probability nr. 

We then repeat the whole series of n experiments a large number t\ of times, 
so that we Anally obtain a series of W repetitions of the combined experiment. The 
total number of performances of © required will then, of course, be n' n. Let v r 
denote the number of occurrences of E r in the whole series of n repetitions of the 
combined •蕙 periment. The frequency interpretation of the probability tzr then consists 
in the following statement: For large values of n\ it is practically certain that the 

〆 

frequency will be approximately equal to tzr. 

Now the Bernoulli theorem as expressed by (16.3.1) shows that, as soon os we 
t 雇 k« n > ■ t , we have tsr < J, where S U given and arbitrarily small. In a long 


series of repetitions of the combined experiment (i. e. for large n), \se should then 


expect the event 


v 


^ f to occur with a frequency umaller than 6. Choosing 


for 6 some very small number, and making one single pcrformnnce of the combined 
experiment, i.e. one single of n repetitions of the experiment we may then 


(cf 18.6) consider it as practically certain that the event 


y __ 
n 


> £ will not occur. 


What value of 6 we should choose in order to realize a satisfactory degree of 
»practical certainty^ depends on the risk that ytt are willing to ran with respect to 
a failure of our predictions. Suppose, however, that we hare agreed to connider a 
certain value d 0 as sufficiently amftU for onr purpose. Returning to the original 
event E with the probability p, w« may then gire the following more precise state¬ 
ment of the frequency interpretation of this probability, as given in 13.5: 


Let e > 0 be given. If tot choote n > ^ w practically certain that; in 


single series of n repetitions of the experiment Cr, we shall have - — p 

n 


< e. 


197 



16.3-4 


This statement may be called the frequency interpretation of tht Bamoulli theorem. 
Like all frequency interpretations, this is not a mathematical theorem, but ft state¬ 
ment concerning certain observable factn, which must hold true if the mathematical 
theory is to be of any practical value. 

16.4. De Moivre’s theorem. 一 The random variable 

(16.4.1) y + ?, + +| M 

considered in the two preceding paragraphs has, by (16.2.1), the mean 
np and the standard deviation 1 npq. The standardized variable (cf 15.6) 

(16.4.2) l = 

]n pa 

thus has the mean 0 and the s. d. 1. The transformation by which 
we pass from v to A consists, of course, only in a change of origin 
and scale of the variable. The ordinates in the diagram of the prob¬ 
ability distribution have the same values for both variables. We have, 
in fact, using the same notations as in the preceding* paragraphs, 

for r = 0, 1, . .w. 

The d. f. and the c. f. of the variable v are given by (16.2.5) and 

(16.2.3) . Denoting by jF n (ic) and g> n (t) the corresponding functions of 
the standardized variable A, we obtain (cf 15.9) 

F n (a;) = B n (rtp •+• xl^)tpq ;p) t 

(16.4.3) / .JlL ， ML-Y 

g>n(t) = + qe r npq ). 

We shall now consider the behaviour of the probability distribution 
of X for increasing values of n, when p has a fixed value. We begin 
by making a transformation of the above expression for the c.f. q> n (0- 
For any integer ^ > 0 and for any real z we have the MacLaurin 
expansion 

(16-4-4) + 

0 • 

where we use ^ as a general symbol for a real or complex quantity 

198 



16.4 


of modulus not exceeding unity. Using this development with it = 3, 
we obtain 


pe l nPQ = j ? 十 


一 pq M _t^ „ pq 9 ^ 

Vnpq 2npq 3\(npqf f} 


pjf 

f/e V npq ： 


pqit — 

Vnpq l 2npq 


+ & 


3\(n p q)* 11 


and hence, introducing in (16.4.3), 

^w = (i-f M +^^p) n - 


Writing 

this ^iyes us 




2 [ V qV'V r n 

lojry“<) = y.;log(l +*:) 


Now as n tends to infinity while t remains fixed, it is obvious that y 
tends to — ^ Hence ^ tends to zero, and ^ log (1 + =) tends to 


2 




unity. It then follows that log (fn{t) tends to — and finally that 


q>n (t) e * 

for every t. 

We are now in a position to apply the continuity theorem 10.4 
for o. f ： 8. We have just proved that the sequence of c.f:s 

defined by (16.4.3) converges, for every t t to the limit e 3 which is 

oontinaoas for all t. By the continuity theorem we then infer 1) that 
a 

the limit e Ms itself the c. f. of a certain d. f., and 2) that the sequence 
of d. f:s {Fn(x)} defined by (16.4.3) converges to the d.f. which cor- 

responds to the c. f. e *. 

Now we have bj (10.5.3) and (10.5.4) 


e a 


J ^ tx d 0{x), 


199 



16.4 


where 



e^dt, 


so that c * is the c. f. of the d.'f. 0(x) given by the last expression. 
This is the important normal distribution function that will be se¬ 
parately treated in the following chapter. For our present purpose 
we only observe that 0(x) is continuous for every x. We have thus 
proved the following limit theorem for the binomial distribution first 
obtained by De Moivre in 1733: 

For every fixed x and p f we have 


(16.4.5) lim B n (np + = <D(x). 


Thus the binomial distribution of the variable v - + 

appropriately standardized by the mean and the s.d. according to 
(16.4.2), tends to the normal distribution as n tends to infinity. We 
shall see later (cf 17.4) that this is only a particular case of a very 
general and important theorem concerning the distribution of the sum 
of a large number of independent random variables. 一 The method 
of proof used above lias been chosen with a view to prepare the 
reader for the proof of this general theorem. In the present particular 
case of the binomial distribution it is, however, possible ip reach the 
same result also by a more direct method, without the use of char¬ 
acteristic functions. This is the method usually found in text-books, 
and we shall here content ourselves with some brief indications on 
the subject, referring for further detail to some standard treatise on 
probability theory. 

The relation (16.4.5) is equivalent to 

x, 

(16.4.6) 一 2 1 f e -- idt 

It 

for any fixed interval U lt X,). Now (16.4.6) may be proved by means 
of a direct evaluation of the terms in the binomial expansion. For 
this purpose, we express the factorials in the binomial coefficient 
appearing in (16.4.6) by means of the Stirling formula (12.5.3). We 
then obtain after some calculations the expression 


(16.4.7) 



1 

^2nnpq 



200 



16.4 


where C is a quantity depending on p, but not on v or ft, while & 
has the 癱 &me significance as before. The first member of (16.4.6) is 
thus equal to 


1 


V27tnpq 


2，* 獄 


-f & 




the sum being extended over the same values of v as in (16.4.6). As 



Scale of 



Fig. 8. Distribution function of v (or 又 ） and normal distribution fa notion. 

p — O.s, n — 6. 



Seal* o( 

人 •緣 - -r — r — i — 1~—i - t _ 

Fig. 0. Distribution function of v (or A) and normal distribution function. 
p = 0.8, n = 30. 

201 









Fig. 11. ynpq • p* q n ~* and normal frequency fanctioo. p = 0.8, n — 30. 


n -> oo ? the second term in this expression tends to zero, while the 
first term is a Darboux sum approximating the integral in the second 
member of (16.4.6) and tending to this integral as its limit. Thus 
(16.4.6) is proved. 

For the graphical illustration of the limit theorem (16.4.5), we may 
in the first place have recourse to a direct comparison between the 
graphs of the distribution functions B n and (Z>, as shown in some 
cases by Figs. 8 一 9. We may, however, also use the relation (16.4.7). 
If we allow here v to tend to infinity with v, in such a way that 

y — fi rf 

tends to a finite limit x } we obtain 

V npq 


202 


16.4-5 


Pr 


n (n 一 1) • • • (ti 一 r + 

H 


—u HH 1 - 11 ":」) ， 

，!U d H* 

for every fixed r = 0, 1, 2, • • •• The stun of all the limiting values 
is unity, since we have 

2 1. 
r —0 

If the probability distribution of a random variable $ is specified by 
(16.5.2) P(f =» r) = for r =* 0, 1 ， 2, . • ” 

203 




P 9 矿 -， 


VJn 


a~T. 


If the scale of v is transformed by choosing the mean np as origin 
and the s.d. Vnpq as unit、and if at the same time every probability 
P 9 is multiplied by Vnpq, the upper end-points of the correspondiDg^ 

1 

ordinates will thus approach the frequency-curve y=-^=.e a of the 

V 2jz 

normal distribution, as n -*» oo. This is illustrated by Figs. 10 — 11. 

16.5. The Poisson distribution. — In the preceding paragraph, we 
have seen that the discrete binomial distribution may, by a limit pas¬ 
sage, be transformed into a new distribution of the continuous type, 
viz. the normal distribution. 

By an appropriate modification of the limit passage, we may also 
obtain a limiting distribution of the discrete type. Suppose that, in 
the binomial distribution, we allow the probability p to depend on n 
in sach a way that p tends to zero when n tends to infinity. More 
precisely, we shall suppose that 

(16.5.1) = 


where it is a positive constant. For the probability P r given by (16.2.4) 
we then obtain, as n -> oo 


\1/ 
X - n 

I 

1 * 

/n\ 

r 

\—/ 
又 in 





0I23A56789I0II 

Fig. 13. Poisson distribution, ). = 3.5. 

5 is said to possess a Poisson distribution. This is a discrete distribu¬ 
tion with one parameter 又 ， which is always positive. All points 
r == 0, 1, 2, . . . are discrete mass points. Two cases of the distribution 
are illustrated by Figs. 12 一 13. 

The c. f. of the Poisson distribution is 

(1(5.5.3) £(«•<；)= 2 e >« 

r=0 

According to (15.10.2), this shows that the semi-invariants of the 
distribution are all finite and equal to 又 . From the two first semi¬ 
invariants, we find the mean and the s. d. of the Poisson distribution 

E (if) = A, D(^) = Vh 


204 


16.5 


Writing — in the second expression (1G.2.3) of the c. f. of the 
n 

l)inomiAl distribution，and allowing n to tend to inOsitj, it is readily 
«een that this function tends to the c.f. (16.5.3) of the Poisson distri- 
bation. Bj the continuity theorem 10.4, it then follows that the bino¬ 
mial distribution tends to the Poisson distribution, which confirms 
the result already obtained by direct study of the probability P r . 

It is also easily shown that the condition (16.5.1) can be replaced 
by the more general condition np -* X } without modifying the result. 

Finally, if ^ and are independent Poisson-distributed variables, 
with the parameters 又 i and 又 ,， the sum + f, has the c.f. 

This is the c.f. of a Poisson distribution with the parameter 4 - A,. 
Thus the sum 5i + St has a Poisson distribution with the parameter 
^•1 •+ Aj, and we see that the Poisson distribution, like the binomial 
has the property of reproducing itself by addition of independent 
variables. Denoting by F(x\ A.) the d.f. of the Poisson distribution, 
the addition theorem for this distribution is expressed bv the relation 

(16.5.4) F[x\ AJ ¥r F[x\ = F(x; -f A f ). 

In atatistical applications, tbe Poisson distribution often appears when we are 
oonceroed with the number of occorreoces of a certain-in a very large number 
of observations, the probability for the event to occur in each obaerTation being ver\ 
small. £xamplM'ar« the anunftl number of suicides in a human population, tbe 
iTHBTWT of yea«t cells in a small sample from a large quantity of suspension ， etc. 
Cf e. g. Bortkiewicz, Kef. 63 a. 

In an important group of applicAtioos, the fundamental random experiment con- 
«i«ts in observing the number of occurrences of a certain event during a time ioter- 
of duration f, where the choice of t is at our liberty. This situation occurs e. g. 
in problems of telephone traffic, where we are concerned with the Dumber of tele¬ 
phone calls daring time interraU of Tarious durations. 一 Soppoee that, in such u 
cts*，the numbers of occurrences daring non-overlapping time intervals are always 
independent. Suppose further that th« probability that exactly one event occurs in 
an interval of duration dt is, for small At, equal to 

where A is a constant, while tbe corresponding probability for th« occurrence of more 
thMi one erent is o (J t). 一 Oifiding a time interval of daration f in n equal parts, 
may consider tbe n parts ns representing n repetitions of a random experiment* 
where the probability for the event to occur in eAcb instance is 

205 



16.5-6 



Allowing n to tend to infinity, w© find that the total Dumber of events occurring 
dariog the time t will be di«tribat«d in a Poisson distribution with the parameter 
At 一 Variables of this type are, besides the number of telephone calls already 
mentioned, the number of disintegrated radioactive atoms, the cumber of claims in 
an inaliranee company, etc. 

16.6. The generalized binomial distribution of Poisson. 一 Suppose 
that , . .., ® n are n random experiments, such that the random vari¬ 
ables attached to the experiments are independent. With each experi¬ 
ment @ r , we associate an event E r having the probability p r = 1 一 g r to 
occur in a performance of @ r . 

Let us make one performance of each experiment .... @ n , and 
note in each case whether the associated event occurs or not. We 
shall call this a series of independent trials. If, in the experiment 
(Jr, the associated event E r occurs, we shall say that the r:th trial 
is a success] in the opposite case we have a failure. Let v be the total 
number of successes in all n trials. What is the probability distribu¬ 
tion of v? 

In the particular case when all the experiments @ r and all the 
events E r are identical, v reduces to the variable considered in 16.2, 
and the required distribution is the binomial distribution. The general 
case was considered by Poisson (Ref. 32). 

In the same way as in 16.2, we define a variable | r attached to 
the r:th trial, and taking the value 1 for a success (probability p r ) y 
and 0 for a failure (probability q r = I — J5r). The variables 5i> • • •，&» 
are independent, and each ha» a distribution of the form (16.1.3). 
As in the previous case, the total number of successes is y = -f 

+ + 十 

The c. f. of the random variable v is the product of the c. f:s of 
all the $ r ： 

£(««»)= + q t ). 


The possible values for y are y =» 0,1, w, and the probability that 

y takes any particular value r is equal to the coefficient of ef ir in the 
development of the product. 

For the mean value and the variance of v we have the expres¬ 
sion 8 


206 



16.6 


(16.G.1) 


£(*>) = 2 = 2 Pr, 

1 1 

D* W = 2 D ， (|r)= ^Prljr. 


Denoting by P the probability function of v y and writing p for the 
1 n 

arithmetic mean 2^ r > an application of the Bienaym4-Tchebycheff 
inequality (15.7.2) now gives the result analogous to (16.B.1) 

， 2) 如和 4 士. 

We thus have the following generalization of Bernoulli’s theorem found 
by Poisson : 

The probability that the frequency of successes v/n differs from the 
arithmetic mean of the probabilities p r by a quantity of modulus at least 
equal to e tends to zero as n <x> ， however small e > 0 is chosen. 


The frequency interpretation of the generalized theorem is quite similar to the 
(me given in 16.8 for the Bernoulli theorem. Consider in particular the case when 
all the probabilities p r are equal to p. We then see that in a long series of indepen- 
dent trials, where the probability of a success is constantly equal to p, though all trials 
fnay be different expenments, it is practically certain that the frequency of successes 
will he approximately equal to p. 

There is also a generalization of De Moivre'e theorem (16.4.6) to the present case. 
This will, however, not be proved here, but will be deduced later as n particular case 
of a still more general theorem to be proved in 17.4. 


For the variance of v, we have found the value D*(v)= J p r Q r . In a series of n 
trials with the constant probability p r ， the corresponding variance is npq ， 


w here q 龙 1 — p — 


n 


J q r . Id order to compare the two ▼ariancea we write 


* E(i> + p r —p) (7 -f p -p r ) 

= np<2—X(p r — p)*. 

Thaa the »Poisson variance* E p r (J r i» always smaller than the corresponding »Ber- 
nonlli variance* npq. At first sight, tbia result may seem a little surprising. It be- 
comes more natural if we consider the extreme case when all the probabilities are 
eqnal to 0 or 1, both values being represented. The Poisson variance is then equal 
to *ero, while the Bernoulli yariance is necessarily positive. 


207 



17.1 


CHAPTER 17. 


The Normal Distribution. 

17.1. The normal functions. —- The normal distribution function. 
which has already appeared in 10.5 and 16.4, is defined by the relation 

X 一 

0(x) = J 0 5 dL 

t —OO 

The corresponding normal frequency function is 


Diagrams of these functions are given in Figs. 14 一 15， and some nu¬ 
merical values are found in Table 1, p. 557. 

The mean value of the distribution is 0, and the 8. d. is 1 , 神 
shown by (10.5.1): 


(17.1.1) 


xd0(x)= vkJ xe % dx = 0, 


x f d Q(x) ■ 


—jnrr. 

V 2tt 


oc 


x l e 2 rfx = 1. 


Generally, all moments of odd order vanish, while the moments of 
even order are according to (10.5.1) 


(17.1.2) 


x u dd>(x)- 


V2^t 

— oo 

Finally, the c. f. is by (10.5.4) 


00 

/ 


x 1 * e a dx = 1 • 3 - .... (2 1 


1 ). 


(17.1.3) 


j ^ 1x dO(x)^= 



*« 

~dx 



208 



Fig. 16. The normal frequency function •‘ 


17.2. The normal distribution. — A random variable ^ will be 
said to be normally distributed with the parameters m and a\ or briefly 

nwmal i£ the d. f. of 5 is d> ( ? -- : ’ ) ， where <7 > 0 and m are 

constants. The fr. I. is then 



and we obtain from (17.1.1) 


209 


17.2 


oo 

.4= f 

aV 2/r J 


一 { 

'o 20 dx = -； ■• -— I (m + ax)e 3 dx^m, 
I 27T J 


(x - ”i)*e 2o * dx = j x*e 2 dx =■ a*, 


so that m and o denote as usual the mean and the s. d. of the variable. 
The frequency curve 

1 — (O*-)’ 

y = e 2o * 

y <yV27t 


is symmetric and unimodal (cf 15.5)，and reaches its maximum at the 
point x = wi, so that m is simultaneously mean, median and mode of 
the distribution. For x ~ m ± a } the curve has two inflexion points. 
A change in the value of m causes only a displacement of the curve, 
without modifying its form, whereas a change in the value of a 
amounts to a change of scale on both coordinate axes. The total area 
included between the curve and the rr-axis is，of course, always equal 
to 1. Curves corresponding to some different values of a are shown 
in Fig：. 1H. 



Fig. 16. Normal frequency curves, m *= 0, d =*= 0.4, 1.0, 2.6. 
210 



17.2 


The smaller we take a, the more we concentrate the mass of the 
distribution in the neighbourhood of x = m. In the limiting case 
<r = 0, the whole mass is concentrated in the point a; = m, and con¬ 
sequently (cf 16.1) the d. f. is equal to e (a; — m). This case will be 
regarded as a degenerate limiting case and called a singular normal 


distribution. The corresponding d. f. d> 


(~ 0 ~) “ways be inter¬ 


preted as b{x — m). 

It is often important to find the probability that a normally dis- 
tributed variable differs from its mean m in either direction by mor& 
than a given multiple Xa oi the 8. d. This probability is equal to the 
joint area of the two »tails，of the frequency curve that are cut off 
by ordinates through the points x = m ± A a. Owing^ to the symmetry 
of the distribution, this is 


P=P(|?~m| > Aa) = 2(l - 0>a)) = 




00 

/ 


■»dx. 


Conversely, we may regard 又 as a function of P, defined by this 
equation. Then X expresses, in units of the s. d. a, that deriation from 
the mean value m，which is exceeded with the g^ven probability P. 
When P is expressed as a percentage, say P = p/100, the corresponding ： 

• f —- m 

X = X p is called the p percent value of the normal deviate ^ — 

Some numerical values of as a fanction of X Pi and of as a func¬ 
tion of p, are given in Table 2, p. 558. From the value ot Ip for 
p == 50, it follows that the qnartiles (cf 15.6) of the normal distribu- 

£ — tn 

tion are m ± 0.6746 a. It is further seen that the 5 % value of - - 

a 

is about 2.o y the 1 % value about 2.6, and the 0.1 % value about 3.8. 
Deviations exceeding four times the standard deyiation have extremely 
small probabilities. 

p — 饥 

The standardized variable -- has the d. f. O (x) and conseqnenilj 

by (17.1.3) the c. f. c a . It follows from (15.9.2) that the variable 5 
has the c. f. 

(17.2.1) £( 巧 = 产十 **. 

From this expression, the semi-inyariants are found by (15.10.2), &nd 
we obtain 


2il 



7.2-3 


7.2.2) Xj = m, x s = a*, x # = x 4 — = 0. 

he moments about the mean of the variable f are 

7.2.3) 内 r+i = 0， = 1 • 3 • . .. (2y — l)a 5 *. 

i particular, the coefficients of skewness and excess (cf 15.8) are 

^, = ^ = 0, == ^ 一 3 = 0. 

Finally we observe that, if the variable 5 is normal (m, a), it 
illows from (15.1.1) that any linear function -\- b is normal 
m + b, I a I a). 

17.3. Addition of independent normal variables. 一 Let . . • ，妄 n 
» independent normally distributed variables, the parameters of 
iing m r and a,. Consider the sura 

s = + 5* + 1 +6. 

enoting by m and a the mean and the s. d. of ws then have by 
5.12.7) 

m — m. -I- + + w n , 

7.3.1) f t 

o s = a* + aj + + a n . 

y the multiplication rule (15.12.1), the c. f. of § is the product of 
ie c. f: a of all the - From the expression (17.2.1) for the c. f. of 
te normal distribution, we obtain 

n 

£( e "£)= JJ 

bis is, however, the c. f. of a normal distribution with the parameters 
and o 、and so we have proved the following important addition 
eorem for the normal distribution: 

The sum of any number of independent normally distributed variables 
itself normally distributed.. 

酬 -( 〒卜 (〒卜 

here m and a are given by (17.3.1). 

212 



17.3-4 


We mention without proof the following converse (Cramer, R«f. 11) of this theo¬ 
rem : If the sum { f i 4* • • • + € n °f n independent variable •“ normally distributed, 
then each component variable 爸 ，“ iUelf normally distributed. Thus it is not only 
true that the normal distribution reproduces itself by composition, bnt f moreover, a 
normal distribation can never be exactly produced by the composition of non-normal 
components. On the other hand, we shall see in tbe following paragraph that, under 
very general conditions, the composition of a large number of non-normal components 
produces an approximately normal distribution. 

Since any linear function of a normal variable is, by the preceding* 
paragraph, itself normal, it follows from (17.3.2) that a linear function 

a \ St + + - h fln + 6 of independent normal variables is itself 

normal, with parameters m and a given by m==a x m, 十 . . .+ + 

and a % -f ••• + a\a\. In particular, we have the important theorem 

that 、 if ， … ， fn are independent and all normal (m, a), the arithmetic 

mean | ~ 2 5* ,s itself normal (w, • 

17.4. The Central Limit Theorem. — Consider a Bum 

<17.4.1) § = ^ + & + . +6 

of n independent variables, where g, lias the mean m y and the s. d. 
The mean m and the s. d. a of the sum ^ are then given by the usual 
expressions (17.3.1). 

In the preceding paragraph we have seen that, if the ^ are nor¬ 
mally distributed, the sum $ is itself normal. On the other hand, 
De Moivre’8 theorem (cf 16.4) shows t)mt, in the particular case when 
the are variables having the simple distribution (16.1.3), the distri¬ 
bution of the sum is approximately normal for large values of n. In 
fact, De Moivre s theorem asserts that in this particular case the d. f. 

of the- standardized variable -— — tends to the normal function (D(a?) 

a 

as n tends to infinity. 

It ts a highly remarkable fact that the result thus established by De 
Moivre's theorem for a special case holds true uvder much more general 
circumstances. 

It will be convenient to introduce the following terminology. 
Generally, if the distribution of a random variable X depends on a 
parameter w, and if two quantities and a 0 (which may or may not 

X 一 ^ 

depend on n) can be found such that the d. f. of the variable —^ 一 - ^ 

a 。 


213 



17.4 


tends to <P (a?) as n -► oo, we shall say that X is asymptotically ncrmal 
(mo, <r 0 ). This does not imply that the mean and the s. d. of X tend 
to m 0 and a 0t nor even that these moments exist, but is simply equi¬ 
valent to saying that we have for any interval (a, 5) not depending 
on n 

lim P(m 0 + a(7n<X<rw 0 + 6cr 0 ) = <P ⑻ 一 (P(a). 

n-* <30 

Thus e. g. the variable v considered in De Moivre’g theorem is asympto- 
ticallj normal (np t V npq). 

The so called Central Limit Theorem in the mathematical theory 
of probability may now be expressed in the following way: Whatever' 
be the distributions of the independent variables 5, 一 subject to certain 
very general conditions 一 the sum ^ + … + § n “ asymptotically nor¬ 

mal (m, a), where m and a are given by (17.3.1). 

This fundamental theorem was first stated by Laplace (Ref. 22) in 
1812. A rigorous proof under fairly general conditions was given by 
Liapounoff (Ref. 146, 147) in 1901. The problem of finding the most 
general conditions of validity has been solved by Feller, Ehintchine 
and L6vy (Ref. 85, 86, 140, 145). We shall here only prove the theo¬ 
rem in two particular cases that will be sufficient for most statistical 
applications. 

Let us first consider the case of equal components ， i. e. the case 
when all the in (17.4.1) have the same distribution. In this case 
we have m = « m„ a = a l Vn i and the standardized variable may be 
written 

§ —m_5~«m t 1 、 

where all the deviations ^—tn x have the same distribution. Denote by 
fp x (<) the c. f. of any of these deviations, while F (ar) and g>(t) are the 

f — fYi 

d. f. and the c. f. of the standardized variable - It then follows 

a 

from (15.9.2) and (15.12.1) that we have 

(n. 4 .2) 参 [ 七 y]' 

The two first moments of the variable 一 m t are 0 and aj, so 
that by (10.1.3) we have for the corresponding c. f. the expansion 

叭⑺ =1 一 J (7; <* + 0 (<*)• 

214 



17.4 


Substituting 


t 

or,Kn 


for <, wa then obtain from (17.4.2) 






where for every fixed ( the quantity ^ ( w » 0 tends to zero as ”•+<». 


It follow 讎 that ^(t)^ e 9 for every t t and hence we infer as in 16.4 
that the corresponding d. f. F(x) tends to Q(x) for every x. We 
thaa hare the following case of the Central Limit Theorem, first proved 
bj Undeberg and L6vy (Ref. 24, 148): 

If f|, 5i，• • • are independent random variables all having the same 
probability distribution, and if and a x denote the mean and the 8. d. 


of every §，， then the mm 5 = 2 5* asymptotically normal (nm,, a x V n). 

l 

It follows that the arithmetic mean f = 士 2 ^ “ asymptotically normal 

i 

(m n ajVn). 


In the case of equal components, it is thus sufficient for the vali¬ 
dity of the Central Limit Theorem to assume that the common dis¬ 
tribution of the bas a finite moment of the second order. When 
we proceed to the general case of variables that are not supposed 
to be equally distributed it is, however, no longer safficient to assume 
that each has a finite second order moment, and thus we have to 
impose some farther conditions. The object of such additional condi¬ 
tions is, generally speaking, to reduce the probability that an indivi¬ 
dual will yield & relatively large contribution to the total value of 
the sum An interesting sufficient condition of this type has been 
found by Lindeberg. We shall, however, here only give the following 
somewhat less general theorem due to Liapounoff: 

Let fj, … be independent random variables、and denote by m，and 
a 9 the mean and the s. d. of Suppose that the third absolute moment 

of 5* about its mean 


is finite for every v, and write 

〆= W W 十 + 


If the condition 


215 



17.4 


(17.4.3) 


lim - 0 

n-»oe 0 


is satisfied, then the sum 5 = 2^ IAf as y m Ptotically normal (w, a), where 

i 

m and <j are given hy (17.3.1). 

In the particular case when all the ^ are equally distributed, we 

have 〆 =w a* = noj, and thus - = —， so that the condition is 

a c.Vn 

satisfied. It should not be inferred, however, that the Lindeberg-L6vy 
theorem proved above is a particular case of the Liapounoff theorem, 
since the former does not assume the existence of the third moment. 

In order to prove the Liapounoff theorem, we denote by g> v (t) the 
c. f. of the v:th deviation f ，一 rw，，and by q)(t) the c. f. of the stand¬ 
ardized sum -—— = - ^ (g, — m*). From (15.9.2) and (15.12.1) it then 

i 

follows that we have 


(17.4.4) 


沖卜] ^，. (❹. 


As before, it is sufficient to prove that for every fixed t we have 

_t* 

<p(t)-^ e 2 when w oo ， as the theorem then directly follows from 
the continuity theorem 10.4. — Using the expansion (16.4.4) with 
左 = 3 ， we obtain 

q>At) = t\ 

where, as in 16.4, we use ^ as a general notation for a quantity of 
modulus not exceeding unity. We further obtain 


where 


l0g 史， (a) = l0g ( ! ~ + = lo 纪 “ + ’)， 








Owing to the condition (17.4.3) we have, however, for all sufficiently 
large values of n 


216 



17.4 


k?<i ， 

a a 

und thus, observing that by (15.4.6) y?e have a v ^ q v for every ，，， 

t==&Q ii + &0 t? ^ ^ ? (!' + ¥)' 

The condition (17.4.3) now .shows that for every fixed t we have 
as "-►oo. Thus certainly |f | < J for all sufficiently larjje For 
|r I < 4 we have, however, 

iogr(l + 之 )=f 一 $ (1 — 臺？ + — . •.) 


and hence 


=r -h g 9 y 


脊 H + 4 卜 4(“ 穿)’ 

= (卜 ㈣ ’). 

Summing over y =■- 1, 2, .. ., n, we now obtain by (17.4.4) 

Ab n tends to infinity, it now follows from the condition (17.4.3) that 
log (p[t) tends to — ^ for every fixed t, and thus the Liapounol? theo 
rem is proved. 

H 

In th« ciue (cf. 10.0) of the variable v _ 之 | r which expreNsen the number of 

1 

Kuoceates Iq a series of n independent trials with th« probohilitlea , ,p tl , w« have 

-P r 9r(P；+ 

n n 

<»'S a， - 2 P,V,，V 


and thus 


217 



17.4-5 


l^(2lPr9r) '• 

00 

If the series ^P r g r is divergent, the Liaponnoff condition (17.4.3) is Mtisfled, and 
l 

tlias the variable v is asymptotically normftl 

(卜 Vb^) 

A sufficient condition for the diTergeno« of Sp r ? r g., that a number <» > 0 can 
be found such that c <p r < 1—c for all r. 一 If, on the other hand, 2l> f ? r i® con- 
vergent/it can be proved (Kef. 11) that the yariable v is not asymptotically normal. 

17.5. Complementary remarks to the Central Limit Theorem. — 
The Central Limit Theorem has been modified and extended in yarious 
directions. In this paragraph, we shall give a few brief remarks on 
some of these questions, while the following paragraphs will be de¬ 
voted to a particular problem belonging to the same order of ideas. 

1. The theorems of the preceding paragraph are exclusively con¬ 
cerned with the distribution functions of the variables. It is the d. f. of 

f — tn 

the standardized sum - that is shown to tend to the normal d. f. 

a 

<D (x). If the component variables all belong to the continnouB type, 

• f ■ ■ ■ ” j 

the question arises if the frequency function of 2 - tends to the nor- 

c a . It can，in fact, be shown (Cramer, Ref. 

11, 70) that this is true if certain general regularity conditions are im¬ 
posed on the components (cf 17.7.4). 

2. In problems of theoretical statistics it often occurs that we 
are concerned with a function ^ (5i, . . ., ?n) of n independent random 
variables, where n may be considered as & large number. If the func¬ 
tion g has continuous derivatives of the first and second orders in the 
neighbourhood of the point m = (wjj, . . ., m»), where wi r denotes the 
mean of , we may write a Tajlor expansion 

n 

(17.5.1) | n ) = o(m ly . . m n ) + 之 c, (f ••一 w,) + JR, 


mal fr. f. (x )= 


218 



17.5 


where e u is the value of 


^9 

djr 


in the point m, while the remainder li 


contains derivatives of the second order. The first term on the right 
hand side is a constant, while the second term is the sum of n in¬ 
dependent random variables, each having the mean zero. By the 
central limit theorem we can then say that，under general conditions, 
the sum of the two first terms is asymptotically normal, with a mean 
equal to the first term. In many important cases it id possible to 
«how that } in the limit as n oo ， the presence of the term R has no 
influence on the distribution, so that the function g is, for large 
values of n, approximately normally distributed (Cf von Mises, Ref. 
157 ， 158). We shall return to this question in Ch. 28. 


3. The central limit theorem may be extended to various cases 
when the variables in the sum are not independent% We shall here only 
indicate one of these extensions (Cramer, Ref. 10, p. 145), which has a 
considerable importance for various applications, especially to biological 
problems. For further information, the reader may be referred to a 
book by Levy (Ref. 25), and to papers by Bernstein, Kapteyn and 
Wicksell (Ref. 63, 135, 230). It will be convenient to use here a termino- 
Logy directly connected with some of the biological applications. If our 
random variable is the size of some specified organ that we are ob¬ 
serving, the actual size of this organ in a particular individual may 
often be regarded as the joint effect of a large number of mntna% 
independent causes, acting in an ordered sequence during the time of 
growth of the individual. If these causes simply add their effects, 
which are assumed to be random variables, we infer bj the central 
limit theorem that the sum is asymptotically normally distributed. 

In general it does not, however, seem plausible that the causes 
co-operate by simple addition. It seems more natural to suppose that 
each cause gives an impulse, the effect of which depends both on the 
strength of the impulse and on the size of the or^an already attained 
at the instant when the impulse is working. 

Suppose that we have n impulses |" • • • ， 5”， acting in the order 
of their indices. These we consider as independent random variables. 
Denote by the size of the organ which is produced by the impulses 
g v . We may then suppose e. g. that the increase caused by 
the impulse f+i is proportional to and to some function g(x v ) of 
the momentary size of the organ : 

{17.5.2) ^+1 = 怎 * + ^v+\g(x v ). 

219 



17.5 


It follows that we have 

㈣， + + 匕 =|^〒_ 

If each impulse only gives a slight contribution to the growth of the 
organ, we thus have approximately 

X 

where x = x n denotes the final size of the organ. By hypothesis 
?i，• . . ， Sh are independent variables, and n may be considered as a 
large number. Under the general regularity conditions of the central 
limit theorem it thus follows that, in the limit, the function of the 
random variable x appearing in the second member is normally dis¬ 
tributed. 

Consider, e. g., the case g (<) = t. The effect of each impulse is 
then directly proportional to the momentary size of the organ. In 
this case we thus find that log x is normally distributed. If, more 
generally, log (x — a) is normal (m, cr), it is easily seen that the 
variable x itself has the fr. f. 

1 (log (x- a)- m、* 

(17.5.3) - - -~ 

a(x — a) V 2 7r 

for x > a % while for x ^ a the fr. f. is zero. The corresponding fre¬ 
quency curve, which is unimodal and of positive skewness, is illu¬ 
strated in Fig. 17. This logarithmico-normal distribution may be used 
as the basic function of expansions in series, analogous to those de¬ 
rived from the normal distribution, which are discussed in the follow 
ing paragraphs. 

•Similar arguments may be applied also in other cases, e. g. in certain branches 
of economic statistics. Consider the distribution of iocomes or property values in a 
certain population. The position of an individual on the property scale might be re¬ 
garded as the effect of a large number of impulses, each of which causes a certain 
increase of his wealth. It might be argued that the effect of such an impulse would 
not unreasonably be expected to be proportional to the wealth already attained. If this 
argument is accepted, wc should expect distributions of incomes or property values to be 
approximately logarithmico-normal. For low values of the income, the logarithmico- 
normal curve seems, in fact, to agree fairly well with actual income curves (Quenscl. 
lief. 201, 202). For moderate and large incomes, however, the Pareto distribution 
diMcaued in 19.3 generally seems to give a better lit. 

220 



o I r J “* 桊 ’ 

Fig. 17. The Jogarithmiro Dormnl dmtribution, frcqnency curve for n « 0, m « 0 4 «， 

<7 — 1 


17.6. Orthogonal expansion derived from the normal distribution. 一 
Consider a random variable f which is the sum 


(17.6.1) 

of n independent random variables. Under the conditions of the 

客 ~~ tHf 

central limit theorem, the d. f. F(x) of the standardized variable —- 

is for large n approximately equal to d>(x). Further, if ail the com- 
ponenta ^ have distributions of the continuous type, the fr f. f[x ) — 
^(x) will (cf 17.5) tinder certain general regularity conditions be 
approximately equal to the normal fr> f. J ) <p (a;) == (D f (x). 一 Writing 


(17.6.2) 


jP(x) = <2> (x) -h li (it), 

f(x) ^q>(x) ^ r (x), 


this implies that R{x) and r(x)= i R ， (x) are small for large values of 
n y so that 0(z) and ^>{x) may be regarded as fir«t approximations to 
F(x) and f(x) respectively. It is then natural to ask if, by further 
analysis of the remainder terms B(x) and r(x), we can find more 
accurate approximationg, e. g. in the form of some expansion of R (^r) 
and r(x) in aeries. 


*) As a rule we use the letter tp to denote a charac-teristic fu&ctioii. In th# 
paragraphs 17.0 aod 17.7, however, tp{x) will denote the normal frequency function 


tp (x) « (») 




while the letter V wiH be n»ed for c. f «. 


221 



17.4 

The same problem may also be considered from a more general 
point of view. In the applications, we often encounter fr. f:s and 
d. f:s which are approximately normal, even in cases where there i» 
no reason to assume that the corresponding random variable is gener¬ 
ated in the form (17.6.1), as a sum of independent variables. It is 
then natural to write these functions in the form (17.6£), and to try 
to find some convenient expansion for the remainder terms. 

We shall here discuss two different types of such expansions. In 
the present paragraph, we shall be concerned with the expansion in 
orthogonal polynomials known as the Gram-Charlier series of type A 
(Ref. 9, 65, 118), while the following paragraph will be devoted to the 
asymptotic expansion introduced bj Edgeworth:. In both cases we shall 
have to content ourselves with some formal developments and some brief 
indications of the main results obtained, as the complete proofs are 
rather complicated. 

Let us first consider any random variable J with a distribution of 
the continuous type, without assuming that there is a representation 
of the form (17.6.1). As usual we denote the mean and the s. d. of ^ 
bj m and cr, while denotes the v:th order central moment (cf 15.4) 
of 5, which is supposed to be finite for all v. We shall consider 
• • 每 ■ ” j 

the standardized variable , and denote its d. f. and fr. f. bj F(x) 

and f(x) = (x). 

For any fr. f. f(x) t we may consider an expansion of the form 

(17.6.3) f(x) ^e 0 q>(x) +~\q>'(x) + ~ y" (x) + ... ， 

where the c* are constant coefficients. According to (12.6.4), we have 
穸⑷ (a;) = ( — 1)’ jy ， (x)5P(x), where H v (x) is the Hermite polynomial of 
degree v y and thus (17.6.3) is in reality an ezpanHion in orthogonal 
polynomials of the type (12.6.2). We shall now determine the coeffi¬ 
cients in the same way as in 12.6, assuming that the series may be 
integrated term by term. Multiplying* with H v (x) and integrating, we 
directly obtain from the orthogonality relations (12.6.6) 

oo 

(17.6.4) c v = (— l) v f H v (x)f(x)dx. 

— 00 

t 一 yy. 

Now f(x) is the fr. f. of the standardized variable -- , which has 

o 


222 



zero mean 


17.6 


and unit s. d., while its r:th moment is ~ . Accordingly 

we find c 0 — 1, c, = = 0, so that the development (17.6.3), and the 

development obtained by formal integration, may be written 

® (x) + g <p( 3 > (*) + g w +.... 

(17.6.5) " ： 

/(*) = 9 > (*) + || q> w ( x ) + || q> w (*}+.••， 

where the c, are given by (17.0.4). From the expressions (12.G.5) of 
the first Hermite polynomials, we obtain in particular, denoting by y, 
and y s the coefficients of skewness and excess (cf 15.8) of the variable 

<•» = = — y" 

(17.6.6) 

c ^-? +W ^ 

Cg — ^ — 4 - 30. 

• f — yfi 

With any standardized variable having finite moments of all 

orders, we may thus formally associate the expansions (17.6.5), the 
coefficients of which are given by (17.6.4). But do these expansions 
really converge and represent f(x) and F(w) <> 

It can in fact be shown (cf e. g. Cramer, Ref. (59, 70) that, whenever 
the integral 

oo ** 

(17.6.6a) fe J dF(x) 

— oo 

is convergent, the first series (17.6.5) will converge for every x to the 
8um F(x). If, in addition, the fr. f. f(x) is of bounded variation in 
(一 oo, oo)，the second series (17.6.5) will converge to f(x) in every 
continuity point of f(x). — On the other hand, it can be shown by 
examples (cf Ex. 18, p. 258) that, if these conditions are not satisfied, 
the expansions may be divergent. Thus it is in reality only for a 
comparatively small class of distributions t£iat we can assert the 

223 



17.6 


validity of the expansions (17.6.5). In fact, the majority of the im¬ 
portant distributions treated in the two following chapters are not 
included in this class. 

However, in practical applications it is in most cases only of 
little value to know the convergence properties of our expansions. 
What we really want to know is whether a small number of terms 一 
usually not more than two or three -一 suffice to give a good approximation 
to f(x) and F (x). If we know this to be the case, it does not con¬ 
cern us much whether the infinite series is convergent or divergent. 
And conversely, if we know that one of the series (17.6.5) is conver¬ 
gent, this knowledge is of little practical value if it will be necessary 
to calculate a large number of the coefficients c v in order to have the 
sum of the series determined to a reasonable approximation. 

It is particularly when we are dealing with a variable § generated 
in the form (17.6.1) that the question thus indicated becomes impor¬ 
tant. As pointed out above, we know that under certain general 
conditions F (x) and f(x) are approximately equal to <P (a:) and g> (jr) 
when n is large. Will the approximation be improved if we include 
the term involving the third derivative in (17.6.5)? And will the 
consideration of further terms of the expansions yield a still better 
approximation 9 It will be seen that we are here in reality concerned 
with a question relating to the asymptotic p'operties of our expansions 
tor large values of n. 

In order to simplify the algebraical calculations, we shall consider 
the case of equal components (cf 17.4), when all the components g,, . . M f w 
in (17 6 1) have the same distribution, with the mean m x and the s. d. 

, so that we have m = a = n. In this case, we now propose 
to study the behaviour of the coefficients c, of the ^l-series for large 
values of n 

. * 石 ■■■■• 

Let \p(t) denote the c. f. of the standardized sum ~ , while 

o 

is the c. f. of the deviation — According to (17.4.2) we 
""*n have 

For y = 1,2, ., let x r denote the semi-invariants of | 一 m = 

2 (S- 一 w i). while are the semi-invariants of ^ — and put 


224 



17 •办 


(17.6.7) k,=~, A：=^- 

< 7 ' a\ 

We then have by (15.12.8) 

(17.6.8) x, = w xl, X v = . 

”‘厂 1 

By the definition of the c. f. %fj(t) we have 

^ yj (t) = J x f(x) dx % 

— 00 

and hence obtain according to (12.6.7) the expansion 

(17.6.9) e ^«)= 2^(-*<)* 

0 • 
or 

(17.6.10) xl ； (t) = e~^ + ^(-itYe~ Li + J! (—,•<)‘，$ + .... 
where c ¥ is given by (17.6.4). 

It should be observed that we cannot in general assert that the 
power series in the second member is convergent, but only that it 
holds as an asymptotic expansion for small values of t in the same 
sense as (10.1.3). 

If we compare (17.6.10) with the expansion 

(17.6.11) f(x) = q> (x) + (a:) + ^ q> [i) ( 怎 ) 十…， 

it will be seen that the terms of the two expansions correspond by 
means of the following relation obtained from (10.5.5): 

00 

(17.6.12) / e ltx <p^(x)dx = (-it) v e~ ：i , (v = 0,1 ， 2, .. .)• 

— 00 

As remarked in an analogous case in 15.10，we may use power 
series of the type (17.6.9) in a purely formal way, without paying any 
attention to questions of convergence, as long as we are only concerned 
with the deduction of the algebraic relations between the various 
parameters, such as the c v and the 又 ;. Thus we may write, in accord¬ 
ance with 15.10 and using (17.6.7), 

225 



17.6 


(17.6.14) 


and generally 




Now 匕 一 ，，h has the mean zero and the a. d. a x . Thus xi==0 and 
Ki = <ri, so that Ai = 0 and 1. Hence we may write the last 
relation 


(17.6.13) 


(! «S ； 

e* x)j(t) = e 3 


In order to obtain an explicit expression for c v in terms of the A; ， it 
now only remains to develop this expression in powers of t, and iden¬ 
tify the resulting series with (17.6,9). In this way we obtain 


rpAt)^ 


which shows that c v is of the form 


(17.6.15) 


+ «,,*以十 •• + g 十 /3 】 1 


where [v/3] denotes the grreateat integer ^ y/3, while the a, a are poly- 
nomials in the Ay, which are independent of w. Thus 

c ，= oU l 甽-， 




226 



17.6-7 


in powers of t furnished expressions of the coefficients c v in the A- 

227 


as n tends to infinity. The following table shows the order of magni¬ 
tude of c v for the first values of v. 

Subscript v. 

3 

4, 6 
5, 7， 9 
8 , 10 , 12 
11, 13, 15 

Thus the order of magnitude of the terms of the ^.-series is not 
steadily decreasing as v increases. Suppose, e. g.，that we want to 
calculate a partial sum of the series (17.6.11), taking account of all 
terms involving corrections to q)(x) of order or n~ l . It then 
follows from the table that we must consider the terms up to y =(> 
inclusive. In order to calculate the coefficients c v of these terms 
according to (17.6.6) or (17.6.14), we shall require the moments or 
the semi-invariants X v up to the sixth order. An inspection of (17.6.14) 
shows, however, that the contributions of order n- l “ and n~ l really 
do not contain any semi-invariants of order higher than the fourth, 
so that in reality it ought not to be necessary to go beyond this 
order. If we want to proceed further and include terms containing 
the factors n 一 w— 2 etc., it is easily seen that we shall encounter 
precisely similar inadequacies. 

Thus the Gram-Charlier mi series cannot be considered as a satis¬ 
factory solution of the expansion problem for F(x) and f{x). We 
want, in fact, a series which gives a straightforward expansion in 
powers of n~\ and is such that the calculation of the terms up to 
a certain order of magnitude does not require the knowledge of any 
moments or semi-invariants that are not really necessary. These con¬ 
ditions are satisfied by Edgeworth’s series, which will be treated in the 
following paragraph. 

17.7. Asymptotic expansion derived from the normal distribution. 
—— In the preceding paragraph, the expansion of the function 


Order of c,. 

,〆/• 

n~ l 

»~ 2 

n~ tft 


丄 I n 

r F 

/l\ 

4lr' 

oos 3 
n 




11 

7 . 

7 . 

/t\ 



17.7 


series. The same function (17.7 1) can however, also be expanded in 
a different way, viz. in powers of Writing 


e 2 xfj (t) ^ e 


S x r+a /i<y 
Vn) 

A： +a / 


we obtain after development 

,^ Kr + Aity +i + b,,, + Aity +i + - +k^ at) 3 'A 

\i) = e 十 2 n «/a ， 

l 

where b v ,v+ 2 h is a polynomial in 又 ; ， ... ， 又 : -*+8 which is independent 
of n. By the integral relation (17.6.12), this corresponds to the ex¬ 
pansion in powers of n" l/, : 

(17.7.2) /(«)=- 炉 ㈤ + 2( 一 n -- ^wa --’ 

1 

the first terms of which are, writing all terms of a certain order with 
respect to n on the same line, 

f(x) = g> (^) 

+ ii '» 55,41 (z) + ^v yt,!(a:) 

1 4 ⑷,、 35 JlU; ,, w 、 280 Ai* > 

~ 5!' ^ 9 (x) "" 7l' y (x) _ ~9\ ' ^ V ( ^ 


By (17.6.7) and (17.6.8) the coefficients may be expressed in terms of 
the semi-inyariants x,, which in their turn may be replaced by the 
central moments fi t by means of (15.10.5). In this way we obtain the 
series introduced by Edgeworth (Ref. 80 )： 

228 




17.7 


f(x) = g> (.r) 

( 17 . 7 . 3) + 卜 W ， 1 雖) V ， 

+. » 

where the terms on each line are of the same order of magnitude. 
In order to obtain a corresponding expansion for the d. f. •F(aj) we 
have only to replace q>(x) by 0(x). 

The asymptotic properties of these series have been investigated 
by Cramer (Ref. 11 ， 70) who has shown that, under fairly general con¬ 
ditions, the series (17.7.2) really gives an asymptotic expansion of 
f(x) in powers of w 1/# , with a remainder term o{ the same order as 
the first term neglected. Analogous results hold «rue for F(x). If 
we consider only the first term of the series, it follows in particular 
that we have in these cases 

(17.7.4) |F(x)-<P(a:)|< |/(.r) - y(x)| < 

where A and B are constants. 1 ) 

The terms of order in Edgeworth's series contain the moments 
， t a , … ， p,.+ 2 ，which are precisely the moments necessarily required for 
an approximation to this order. ， In practice it is usually not advisable 
to go beyond the third and fourth moments. The terms containing 
these moments will, however, often be found to give a good approxi¬ 
mation to the distribution. For the numerical calculations, tables of 
the derivatives (x) will be required. These are given in Table 1 ， 
p. 557. 

Introducing the coefficients y t and y % of skewness and excess (cf 
15.8), we may write the expression for f(x) up to terms of order n~ x 

(17.7.5) f(x) = if (x) — ^ y (3) (x) + <p {i) (x) + —j 1 (x). 


*) It has been shown by Esseen (Ref. 83) and Bergstrttm (Ref. 62) that the 
inequality for | F — 4* | holds under the sole condition that is finite. 

229 




1 •—3t*/ 

Pig. 18. Derivatives of the normal frequency function ，（ : r) _ - 

- 

- 



Diagrams of the derivatives tp w t and f〆 6 、, with the numerical 
coefficients appearing in (17 7 5), are shown in PiflT. IB. The ourve§ 
for 妒⑷ and are symmetric about a? = 0, while the third derivative 
中⑻ introduces an asymmetric element into the expression. 

For large x y the expression (17.7.6) will sometimes yield small nega¬ 
tive values for f(x). Tbis is, of course, quite consistent with the fact 
that (17.7.5) giTes an approximate, but not an exact 、expreMion for 
the frequenoy function. 

For the mode x 0 of the fr. f M we obtain from (17.7.5) the approxi¬ 
mate expretaion a; 0 — — J y,, which is Charlier’a measure of skewnesi. 
We further have 



ir$-Ari- 


The flnt member repreeenfcB the relative excess of the frequency curve 
V = f(x) over the normal curve y = tp (a?) at the point x *=» O. 1 ) For 

•) If, instead or comparing the ordinntea in the mean ae « 0, w« coniparo the ordi- 
natff in the tnodrt of tbe two ourtei, we obtain In the first npproxtiuailon 


/M — y(o) 

f (0)' 


i y* - A yl 


230 




17.7-8 


this quantity, Charlier gave the expression J y 9i which he introduced 
as his measure of excess. However, it follows from the above that 
the term in y] must be included in order to have an expression of 
the excess which is correct up to terms of the order ，广 1 (cf 15.8). 

17.8. The r61e of the normal distribution in statistics. —- The 
normal distribution was first found in 1733 by De Moivre (Ref. 29), 
in connection with his discussion of the limiting form of the binomial 
distribution treated in 16.4. 

De Moivre s discovery seems, however, to have passed unnoticed, 
and it was not until long afterwards that the normal distribution 
was rediscovered by Gauss (Ref. 16, 1809) and Laplace (Ref. 22, 1812). 
The latter did, in fact, touch the subject already in some papers 
about 1780, though he did not go deeper into it before his great 
work of 1812. Gauss and Laplace were both led to the normal 
function in connection with their work on the theory of errors of 
observation. Laplace gave, moreover, the first (incomplete) statement 
of the general theorem studied above under the name of the Central 
Limit Theorem, and made a great number of important applications 
of the normal distribution to various questions in the theory of proba¬ 
bility. 

Under the influence of the great works of Gauss and Laplace, it 
was for a long time more or less regarded as an axiom that statistical 
distributions of practically all kinds would approach the normal dis¬ 
tribution as an ideal limiting form, if only we could dispose of a 
sufficiently large number of sufficiently accurate observations. The 
deviation of any random variable from its mean was regarded as an 
»error», subject to the »law of errors» expressed by the normal 
distribution. 

Even if this view was definitely exaggerated and has had to be con¬ 
siderably modified, it is undeniable that, in a largre number of im¬ 
portant applications, we meet distributions which are at least approxi¬ 
mately normal. Such is the case, e. g., with the distributions of errors 
of physical and astronomical measurements, a great number of demo- 
graphical and biological distributions, etc. 

The central limit theorem affords a theoretical explanation of these 
empirical facts. According to the »hypothesis of elementary errors 
introduced by Hagen and Bessel, the total error committed at a physi¬ 
cal or astronomical measurement is regarded as the sum of a large 
number of mutually independent elementary errors. By the central 

231 



17.8 


limit theorem, the total error should then be approximately normally 
distributed. 一 In a similar way, it often seems reasonable to regard 
a random variable observed e. g. in some biological inveflti^ation as 
being the total effect of a lar^e number of independent caiiHea, which 
sum up their effects. Tho same point of vipw may be applied to the 
variableR occurring in many technical and economical c|uest.ions Thus 
the total consumption of electric energy delivered by a certain pro- 
tlucor is the sum of the quantities consumed by the vnriouH customerfl, 
the total ^ain or loss on the risk business of an insurance company 
in the Hum of the plains or losses on each single policy, etc 

[n cases of this character, we should ox pec t to find at least 
approximately normal distributioim If the number of oompononts is 
not sufficientl) large, or if the variouH components cannot, be regard(*<1 
as strictly additive and indopemlent, the modifications of the central 
limit theorem indicated in 17 f) — 17 7 may still show that the《lintri 
bution is approximately normal, or thpy may indicate the use of some 
distribution closely related to the normal, such as the asymptotic 
ex{>ansion (17.7.3) or the logarithmico-normal distribution (17 5.3) 
Under the conditions of the central limit theorem, the arithmetic 
mean of a large number of independent variables is approximately 
normally distributed The remarks made in connection with (17.5 1) 
imply that this property holds true even for certain functions of a 
more general character than the mean These properties are of a 
fundamental importance for many methods used in statistical practice, 
where we uro largely concerned with means and other similar func¬ 
tions of the observed values of random variables (cf Ch. 28) 

Thoie is a famous remark bj Lippman (quoted by PoincarS, Ref 
•31) to the effect that ^everybody believes in the law of errors, the 
experimenters because they think it is a mathematical theorem, the 
inathomaticians because they think it ia an experimental fact». 一 It 
hp(mhh appropriate to comment that both parties are perfectly right, 
provided that their belief is not too absolute: mathematical proof 
tplls uh that, under certain qualifying conditions ，we are justified in 
expecting a normal distribution, while statistical experience shows 
that, in fact, distributions are often approximately normal. 


m 



18.1 


CHAPTER 18. 

Various Distributions Related to the Normal. 

In this chapter, we shall consider the distributions of some simple functions of 
normally distributed variables All these distributions have important statistical 
applications, and will reappear in various connections in Part III. 

18.!. The distribution. — Let ^ be a random y a riable which is 
normal (0, 1). The fr. f. of the square §* is, by (15.1.4), equal to 

1 _： 

— € 2 
V 2 7t x 


for a; > 0 For x ^ 0, the fr. f. is zero. The c.f. corresponding to 
this fr. is obtained by putting « = A = J in (12.3.4), and is 



■ --L=e~idx = (\ 
r27tx 


Let now f, t |» be n independent random variables, each of 

which is normal (0, 1), and consider the variable 

( 18 . 1 .D 

l 

Each §1 has the c.f. (1 — and thus by the multiplication theo¬ 

rem (15.12.1) the Bum jf* has the c.f. 

n 

(18.1.2) JE(e“0 = (l — 2 以厂 』. 

This is, however, the c.f. obtained by putting a = J, / == J« in (12.3.4), 
and the correspondiag distribution is thus defined by the fr. f. 
f(x\ J, |n) as given by (12.3.3). We shall introduce a particular nota¬ 
tion for this fr. f., writing for any m == 1, 2, •… 

1 «-i _* 

~ e 1 for x > 0, 

(18.1.3) = 

0 for x ^ 0. 

233 



18.1 


Thus K(x) is the fr.f. of the variable that we bare 

h H (x)dx == P(x < % 9 < x dx). 

The corresponding d. f. is zero for while for a: > 0 it is 



The distribution defined by the fr.f. X ； n(x) or the d.f. K n (x) it known 
as the x 、 distribuUort, & name referring to an important statistical app¬ 
lication of the distribution. This will be treated in Ch. 30. The 
distribution contains a parameter n t which is often denoted as the 
number of degrees of freedom in the distribution. The meaning of this 
term will be explained in Ch. 29. The ^^distribution was first found 
by Helmerfc (Ref. 125) and K. Pearson (Ref. 183). 

For ii ^ 2, the fr. f. k n (x) is steadily decreasing for x> 0 } while 
for n > 2 there is a unique maximum at the point a: = n — 2. Dia- 
gfraras of the function are shown for some values of n in Fig. 19. 

The moments a v and the semi invariants x» of the ^^distribution 
are Bnite for all v y and their general expressions may be obtained 
e. g. from the c. f. (18.1.2), using the formulae in 10.1 and 15.10: 


(18.1.5) 


a v — n(n + 2) ••- (n 4- 2y 一 2), 
x* = 2 r ~ 1 (v-- l)!w. 


Hence in particular 

(18.1.6) = cri==w, D* (^*) = a t 一 aj = 2 w. 


Let X? and 乂 be two independent variables distributed according 
to (18.1.4) with the values n, and n t of the parameter. The expres¬ 
sion (18.1.2) of the c. f. of the /-distribution then shows that the 
c. f. of the sum x* + X* 

n, 

(1 一 = (1 一 2it )—~ r 

Thus the distribution, like the binomial, the Poisson and the nor* 
mal, reproduces itself by composition, and we have the addition theorem: 

(18.1,7) K nt (x)^ K^x)^ AV”,(4 

234 



18 .! 



This may, m fact, be regarded as an evident consequence of the de¬ 
finition (18.1.1) of the variable z*，since the sum %? + is the sum 
of independent squares. 

Extensive tables of the %*-distribution are available (Ref. 262, 264, 
265). Ia many applications, it is important to find the probability P that 
the variable assumes a value exceeding a given quantity This prob¬ 
ability is equal to the area of the tail of the frequency curve situated 
to the right of an ordinate through the point x ^ xl Thus 

00 

P = p(z* > Zo) ^ / kn (x)dx = 1 一 

4 

Usually it is most convenient to tabulate xl as a function of the 
probability P. When P is expressed in percent, say P = p/100, the 

2S5 




18.1 


corresponding Xo^Xp i® called the p percent value of % % ^or n degrees 
of freedom. Some numerical Tallies of this function are given in 
Table 3 f p. 559. 

We shall now give some simple transformatioDs of the ^-distribu¬ 
tion that are often required in the applications. 

If each of the independent variables x n is normal (0, or). 


where a > 0 ia an arbitrary constant, the variables —, . . M — are 

o a 

independent and normal (0, 1). Thus according to the above the fr. f. 
of the variable 冬 ( 会 ) ia equal to Then by (15.1.2) the fr. f. of 

n 

the variable 2 x * 


(18 1.8) 


h kn (^) 


2 W(f) 


(x > 0 ). 


Bj similar easy transformations, we find the fr.f:s of the arithmetic 
mean ~ the non-negative square root j/ an ^ the 


square root of the arithmetic mean 


/r 7 ~ 

V 


The results are shown 


in the following table. x it . . M x n are throughout supposed to be in¬ 
dependent and normal (0, a). For x<0 t the fr. f:B are all equal to 
zero. 


Variable. 


:2 x : 

l 




Frequency function (x > 0). 


n 

2Vr 






(W 




X ： 


a* 

緣 l. (|) 


n 


x M ^ l e 




236 



18.1-2 


VH 3 ' 

I 


2 nx 


in 


㈤ 


I 


-i ^~V7> iX 


x n ' l e 


It the horizontal and vertical devifttioDS u and t* of a shot from the centre of 
the target are indepsndent and normal (0, o\ the distance r = + r* from the 

centre will have the fr. f. 


X* 



If the components m, r and w of the velocity of a molecule with respect to 
a system of rectangalar axea are indepeadent and normal (0, a\ the velocity 
r =mV «• -f v* + ip* will have the fr. f 


2x 

a* 


Kl 




18.2. Student’s distribution. 一 Suppose that the w + 1 random vari¬ 
ables § and ? n are independent and normal (0, a). Let us write 



where the square root is taken positively, and consider 


the variable 
(18.2.1) 




i 


Let S n (x) denote the d. f. of the variable t, so that we have 


■S ', (和 P(K X) = P (暑 s 怎 ) • 

Bj hypothesis § and r) are independent variables, and thus according 
to (15.11.3) their joint fr. f. is the product of the fr. f s of ^ and rj 
Now f U normal (0, a), and tj has the fr. f. given in the last line of 
the table in the preceding paragraph, so that the joint fr.f. is 1 ) 


*) As » rule we have hitherto used corresponding letters from different alphabets 
to denote » random rariable and the variable in it« d. f. or fr. f. y and have thus 
employed expressions such as. »The random yanable ^ has the fr. f. /(a)». When 
dealing with many variables simultaneously it is, however, sometimes piactical to 
depart from this rale and use the same letter in both places. We shall thus ov- 
casionally use expreaaions such as *The random variable | has the fr. t. /(|)» oi 
•The random variables § and rj have the joint fr. f f(^ t rj)*. 

237 



18.2 




_?+•， 产 


where rj > 0 and 


^ n o n ^ r 


(t) 


The probability of the relation 

V 


the integral of the joint 
fr. f over the domain defined by the inequalities 17 > 0 , f < xrj: 


S n (x) = Ca II v n~i e ~~Tir d ^ drj 


‘//， 

«>0 

i<xt ： 


Introducing new variables «i, v by the substitution 

(18.2.2) § = wt ;， t]== v } 

the Jacobian of which is ) = v, we obtain 

x 00 

/ r »+!*• 

du I 


v^e 


V* 


dv 


(18.2.3) 




du 


-« (w- + U*) 




The corresponding fr.f. s n (x)= Si (a:) exists for all values of x and 
is given by the expression 

rl!L±l\ . _!i±i 

士 ( l 、} 

238 


(18.2.4) 





18.2 

i 

The distribution defined by the fr. f. (: r) or the d. f. 5 n (rr) is 
known under the name of Student's distribution or the t-distribution. 
It was first used in an important statistical problem by W. S. 
Gosset. writing under the pen-name of »Student* (Ref 221). As in 
the case of the z*-di 8 tribudon, the parameter n is often denoted as 
the Number of decrees of freedom in the distribution (cf. 29.2). 

From the expression of the fr. f. s n (s:), it is seen that the distribution 
is independent of the s. d. a of the basic variables ^ and f,. This 
was, of course, to be expected since the variable ^ is a homogeneous 
function of degree zero in the basic variables. — It is further seen 
that the distribution is unimodal and symmetric about x = 0. The 
r.th moment of the distribution is finite for v < n. In particular, the 
mean is finite for n > 1, and the 8. d. for n > 2. Owing to the sym¬ 
metry of the distribution, all existing moments of odd order are zero, 
while a simple calculation gives 

QD 

D 1 (t) = jx 2 s n (x)dx = } 

— OD 

and generally for 2v < n 

— — 1 *3 - .. (2 v 一 l)n* 

aiv (n — 2)(n — 4) (n — 2 v) 

The probability that the variable t differs from its mean zero in 
either direction by more than a given quantity t 0 is, as in the case 
of the normal distribution equal to the joint area of the two tails 
of the frequency curve cut off by ordinates through the points ± t Q . 
On account of the symmetry of the ^-distribution, this is 


(18.2.5) P = P(| ^| > ^ 0 ) = 2 J s n (T)dx = 2(1 — S n (t 0 )). 

t 9 


From this relation, the deviation t 0 may be tabulated as a function 
of the probability P. When P=j?/100, the corresponding t 0 = t p is 
called the p percent value of t for n degrees of freedom. Some 
numerical values of this function are given in Table 4, p. 560. 

For large values of w, the variable t is asymptotically normal (0,1 )， 


in accordance with the relations 

lim 5,,(a，) = <D(x\ lim s n (x) = (D f (x) = -^=e~ y 

n-oo y 2n 


239 



18.2 



Fig. 20. Stodent's distribution, frequency curve for n == 3 : - . Normal frequency 

curr®, m 0, or •*» 1:. 


which will be proved in 20.2. For small n the ^-distribution differs, 
however, considerably from the limiting normal distribution, as seen 
from Table 4, where the figures for the limiting case are found under 
w=oo. A diagram of Student's distribution for n = 3, compared 
with the normal curve, is given in Fig. 20. It is evident from the 
diagram that the probability of a large deviation from the mean is 
considerably greater in the /-distribution than in the normal. 


If, instead of the variable t as defined by (18.2.1), we consider the Tariable 

(18.2.8) r = ^-=— 矣 (n > 1), 

i /T^ 


the numerator and the denominator are no longer independent, »nd the distributioii 
cannot b« obtained in the same way m before. It is obTiona that we alwajt h*Te 
t* ^ n» no that the fr. f. of x is certainly equal to zero outside the interral 

Writing 



it is Bten that f is giren by an expnMaion of the form (18.2.1), with n replaced by 
n — 1. Thm H is difltribnted in Student's distribution with the d.f. S nmml (x\ When 

t increases from —Vn to + V^n, it \m farther Men that f increases steadily from 
oo to + co. It follows that th« relation r < * i» eqaiTalent to the relation 


f < 




1 ^ 




240 




(18.3.1) 


Let F m n (x) denote the d.f. of the variable x. Since | and rj are both 
non-negative, we have x ^ 0, and is equal to zero for a: < 0. 

For a: > 0 ， we may use the same method as in the preceding para¬ 
graph to find F m n (a?). Since by hypothesis 5 and rj are independent ， 
F mn (x) is equal to the integral of the product of the fr. f：B of ? an d 
rj over the domain defined by the inequalities rj > 0 t 0 < ^ < xrj. The 
fr. f:s of 5 and rj may be taken from the table in 18.1, and so we obtain 


Fmn{x) ― dm 




eWdgdrj ， 


18*3 — 3 


and we have 

We have thus found the d. f. of the variable t. Differentiating with respect to .r, 
we obtain for the fr. f. of r the expression 

{16 - 2V V^( l -i) * - y^zj = (Mj ' 

where | x | ^ Vn. For n = 2, the frequency carve is »U-shaped», i. t. it has a miwi- 
mum at the mean a; == 0. For n *= 8, the fr. f. is constant, and we have a 
rectangular distribulion (c( 10.1). For n > 3, the distribution is animodal and sym¬ 
metric about x = 0. The mean of the distribution is 0, and the s. d. is 1 for all 
values of n. 

18.3. Fisher，s z-distribution. — Suppose that the m + n random 
variables 匕 ， • • .， |m, t ^、 . • .， t)n are independent and normal (0, a). Put 


P(T<x) = Pit / 


5=2^- ” ^ 2 A 


and consider the variable 


II 

I 

II 


where 


241 



18.3 


am n 


2^.-r(f)r(?) 

Introducing new variables m, ?; by the substitution (18.2.2), we find 

_K+i 

dv 


: du. 



替⑶ J (… f 


Hence we obtain bj differentiation the fr. f. fmn(x) = Fmn(x) of the 
variable x: 


(18.3.2) f mn (x)- 


r 


Jm + n\ 

Jzo 

㈣) 


m+n 

(0； 十 1) 2 


,(x > 0 ). 


Like the ^-distribution, this is independent of a. In the particular 
case m = 1, the variable nx has an expression of the same form as 
the square of the variable t defined by (18.2.1). 

In the analysis of variance introduced by R. A. Fisher (cf Ch. 36), 
we are concerned with a variable z defined by the relation 


(18.3.3) 


m 




e 2 l== m X： 


1 n 


The mean and the variance of the variable e 3 * are easily found from 
the distribution of x: 


(18.3.4) 


£(««) =-EW = ^, 


(» > 2 ), 

- 2 ) 


= 0 卜 ^^ , (n>4). 

242 



18.3-4 


For m > 2, the distribution of e 2t has a unique mode at the point 

m — 2 n 

- - -- . 

m n + 2 

In order to find the distribution of the variable z itself, we ob¬ 
serve that when x increases from 0 to oo, (18.3.3) shows that e in¬ 
creases steadily from — oo to + ». Thus the relation g < x is 

ffi 

equivalent to x < — and the d. f. of z is 
n 

P(i<x) = p(x<~e^ = (= e 寸 


Differentiating with respect to x y we obtain for the fr. f. of z the 
expression given by R. A. Fisher (Ref. 13, 94) 


(18.3.5) 





e mx 

w-f n 

(m * + n)~iT* 


18.4. The Beta-distribution. — Usings the same notations as in 
the preceding paragraph, we consider the variable 1 ) 


(18.4.1) 


m 

l 


1 + x 


2 ^ + 2 $ 


We obviously have 0 ^ A ^ 1, so that the fr. f. of k is zero outside 
the interval (0, 1). As x increases from 0 to oo, X increases steadily from 

0 to 1. The relation 又 < a; ig thus equivalent with x < ~ ， and 

the d. f. of X is 

P ( A< — P ( X<1 士卜 ㈢ • 

Hence we obtain the fr. f. of A ： 

(輯 ( rV -( r :+ 馨普广 W 卜 

*) In the particular case m ， 1, the Tarlable (n + 1)A has ao expresaion of the 
same form m the square of the yariable x defined bj (18.2.6). 

243 



18.4-19.1 


This is the particular case p = g == ^ of tbe fr. f. ^{x\ p, q) given 
bj (12.4.5). In the general case, the distribution defined by the fr. f. 

(18.4.3) - (0<x<l, jp>0,3>0), 


will be called the Beta-distribution. The r:th moment of this dUtribu- 


tion is 
(18.4.4) 


J * r (?(x; P ， q)dx = I -jr^ 


r{p + q) 

r(p + q + >) 


Hence in particular the mean is 


V 

多 + (/’ 


while the variance is 


_ PQ _ 

(p + 9)'(p 十 g + i) 

For p> 1, (/ > h there ia a unique mode at the point x 


p + g ——2 


CHAPTER 19. 

Further Continuous Distributions. 

19.1. Th« rectAngular distribution. ― A random Tari&ble { will 
be said to bare a rectangular distribution } if its fr. f. is constantly equal 

to in a certain finite interval (a — h, a + h), and zero outside this 
2 h 

interval. The frequency curve then consists of a rectangle on tbe 
range (a — A, a + ^) at base and of height We shall alto say in 
this cate that $ is uniformly distributed over [a — h y a ^). The mean 
of this distribution is a, and the variance i» 

The error introduced iD % numerically calculated quantity by the »rounding off* 
may often be considered as uniformly distributed over the rangt (— 1，JX in unit# of 
the lMt figure. 

By a linear transformation of the variable, the range of the distri¬ 
bution may always be transferred to any given interval. Thus e. g. 

244 



19.1 


the variable rj 




is uniformly distributed over tk« intcWal 


(0, 1). The corresponding fr. f. it 


/!W: 


1 

o 


in (0, 1), 
outside (0, 1). 


If tj iy . . . are independent variables uniformlj distributed over (0,1), 
it it evident that the sum i) t + • + ”轉 is confined to the inter ▼纛 1 
(0, n). If f n {x) denotes the fr. f. of % + • + it thus follows that 
f%(x) is zero outside (0, w). It further follows from (15.12.4). that we 
have 

» X 

/ii+lW = / /l — t)fn (t)dt = f/ n (t)dt. 


AM- 


{: 


From this relation, we obtain by easy calculations 

for 0 < x < 1, 

2 (x — 1) for 1 < a; < 2, 

1^* for 0 < x < 1, 

ft (x) ~ I i (x* 一 3 fcc — 1)*) for 1 < x <2, 

U (x* — 3(x — 1)* + 3(a; — 2 )，） for 2 < x < 3. 

The general expression, which may be verified by induction, is 

/*W = 卜 1 - ⑺ (x - 1” + ⑵ (x - 2)"- 1 — j 

where 0 < a; < n, and the summation is continued as long as the 
arguments x, x—l t a; — 2, ... are positive. 

fx is a discontinuous frequency function, /, is continuous but has 
a discontinuous derivative, / 8 has a continuous derivative but a dis¬ 
continuous second derivative, and so on. Diagrams of /,, /, and / s 
are shown in Fig. 21. The mean and the 8. d. of the sum t} { + • • + rj n 

are i and Ks- 80 that the fr. f. of the standardized sum is 

]/ r ^2 /x (l +X V^)' 

As n increases, this rapidly approaches the normal frequency function 

1 ^ 

， r~r m ' 芒 】 . 


245 



Fig. 21. Rectangular and allied distributions 


The expression of ft(x) given above may be written in the form 

^(^^l-ll-x.l ， (0 < a: < 2). 

This fr. f.，and any fr. f. obtained from it by a linear transformation, 
is sometimes said to define a triangular distribution. 

19.2. Cauchy's and Laplace’s distributions. 一 In the particular 
case n = \ y Student's distribution (18.2.4) has the fr. f. 

1 

^rra ， 

the c. f. of which is, by (10.5.7)，equal to e~^. By a linear trans¬ 
formation, we obtain the fr. f. 

(19.2.1) 咖； 又， 

with the c. f. 

(19.2.2) 

where 又 > 0. The distribution defined by the fr. f. c(x; fi) } or by 
the corresponding d. f. C(x; A, fi) y is called Cauchy's distribution. The 
distribution is unimodal and symmetric about the point x = ft ，which 
is the mode and the median of the distribution. No moment of posi¬ 
tive order, not even the mean，is finite. The quartiles (cf 15.6) are 
p 土又， so that the semi-interquartile range is equal to L 

If a variable $ is distributed according to (19.2.1)，any linear 
function a 爸 + b has a distribution of the same type, with parameters 
V = I fl I 又 and 〆 =afi + b. 


246 




19.2-3 


The form (19.2.2) of the c. f. immediately shows that this distribu¬ 
tion reproduces itself by composition, so that we have the addition 
theorem : 

(19.2.3) C(x\ fi x )^ C(x; n t ) = C(x; X t + ‘〜 + 〜)， 

Hence we deduce the following interesting property of the Cauchy distri¬ 
bution : If 5i ， • • •， §« are independent^ and all have the same Cauchy 

_ i w 

distribution, the arithmetic mean g S 。 has the same distribution as 

every 1 

The two reciprocal Fourier integrals (10.5.6) and (10.5.7) connect 
the Cauchy distribution with the Laplace distribution y which has the 
fr. f. Je 一 1*1. The latter fr. f. has finite moments of every order, while 
its derivative is discontinuous at a; = 0. By a linear transformation, 
we obtain the fr. f. 

U9.2.4) 
with the c. f. 


19.3. Truncated distributions. — Suppose that we are concerned 
with a random variable §, attached to the random experiment 6. Let 
as usual P and F denote the pr. f. and the d. f. of From a se¬ 
quence of repetitions of ©，we select the sub-sequence where the 
observed value of § belongs to a fixed set S 0 . The distribution of 
^ in the group of selected cases will then be the conditional distri¬ 
bution of 5, relative to the hypothesis ^ < S 0 . According to (14.3.1) or 
(14.3.2), the conditional probability of the event ^ < S t where S is 
any subset of 5 0 , may be written 


p(f< sm) 


P(g<s) 

W^So) 


The case when S 0 is an interval a often presents itself in tire 

applications. This means that we discard all observations where the 
observed value is ^ a or > b. The remaining cases then yield a 
truncated, distribution with the d. f. 


0 


F(x\a < f ^ 6)= 


F(x)-F(a) 

F(b)-F(a) 


for x a t 
tor a < x 
for x> b. 


247 



19.3-4 


If a fr.f. f(x) = F" (x) exists, the truncated distribution has a fr. f. 
equal to 

/(x I a < 5 < 6) = 

a 

for all x in (a, b) t and zero outside (a, b). Either a or b may, of 
course, be infinite. 

The truncated normal distribution. Suppose that the stature of an individual 
presenting himself for military inscription may be regarded as ft random variable 
which is normal (m, a). It only those cases are passed where the stature exceeds a 
fixed limit x 0 , the stotures of the selected individuals will yield a truncated normal 
distribution, with the d.f. 

少 ( E 二^一伞 二^ 

' a / 」—/, c* > r 0 ). 

少 'plLZ^) 

Writing A - - ~the two first moments of the truncated distribution arc 

a, = m + A <r, «j = wt 1 + A a (a 0 + m) + a*. 

If x ot « t and a t are given, while m and a are unknown, two equations are thus 
available for the determination of the two unknown quantities. Tables for the 
numerical solution of these equations have been published by K Pearson (Ref. 204). 

2. Pareto b distribution. In certain kinds of economic statistics, we often meet 
truncated distribations. Thus e. g. in income statistics the data supplied are usnally 
concerned with the distribution of the incomes of persons whose income exceeds a 
certain limit x 0 fixed by taxation rules. This distribution, and certain analogous 
distributions of property values, sometimes agree approximately with the Pareto 
distribution defined by the relation 

P(| > x)= ( 二 )' > x 0 . a > 0). 

The fr. f. of this distribution is — (?®\ a+1 f 0 r x > x 0 , and zero for x ^ x 0 . The 

x 0 \xl 

mean is finite for a > 1, and is then equal to ~—r x 0 The median of the distribu- 

cc 一 l 
l 

tiou is 2 a x^. 一 With respect to tbe Pareto distribution, we refer to some papers by 
Hagstroem (Ref 121, 122). 

19.4. The Pearson system. 一 In the majority of the continuous 
distributions treated in Chs. 17 一 19, the frequency function y =/(x) 
satisfies a differential equation of the form 

248 



19.4 


(19.4.1) 


b 0 + b k x + b t x l 


y ， 


where a and the b\s are constants. It will be easily verified that tbis 
is true e. g. of the normal distribution, the distribution, Student 1 ! 
distribution, the distribution of Fisher's ratio c 2 *, the Beta distribu¬ 
tion, and Pareto’s distribution. Any distribution obtained from one 
of these by a linear transformation of the random variable will, of 
course, satisfy an equation of the same form. 

The differential equation (19.4.1) forms the base of the system of 
frequency curves introduced bj K. Pearson (Ref. 180,181, 184 etc.). It can 
be shown that the constants of the equation (19.4.1) may be expressed in 
terms of the first four moments of the fr. f., if these are finite. The solu¬ 
tions are classified according to the nature of the roots of the equa¬ 
tion b 0 + b x x + b t x % = 0, and in this way a great variety of possible 
types of frequency curves y =f(x) are obtained. The knowledge of 
the first four moments of any fr. f belonging to the system is suffi¬ 
cient to determine the function completely. A full account of the 
Pearson types hai been given by Elderton (Ref. 12), to which the 
reader is referred. Here we shall only mention a few of the most 
important types. The multiplicative constant A appearing in all the 
equations below should in every case be so determined that the inte¬ 
gral with respect to x over the range indicated becomes equal to unity. 


Type I. y = A(x — a) p_1 (6 — a<x<b, p > 0, g > 0. 

For a = 0, ft == 1 we obtain the Beta distribution (18.4.3) as a par¬ 
ticular case. Taking p = g = i b' a = —- 6, and allowing b to tend to 
infinity, we have the normal distribution as a limiting form. Another 
limiting form is reached by taking- q = ba, when 厶一 > oo we obtain 
after changing the notations the following ： 

Type III. y = A(x — ^Y- { e- a{x -^\ x > .a, or > 0, 又 〉 0 
This is a generalization of the fr f. f(x, a, A) defined by (12 3.3), and 
thus a fortiori a generalization of the 义 , -distribution (18.1.3). 


Type VI. 1 / = A{x — a) v ~ l (x — b ) 9 ^ 1 , x>b t a<b, q>0, p + q< l. 
This contains the distribution (18.3.2) as a particular case (a —— 1. 
b = 0 ). 


Type VII. y 


A 


'a)* -f 


co < x < <x>; in > J. 


This contains Student s distribution (18.2.4) as a particular case. 


249 



20.1-2 


CHAPTER 20. 

Some Convergence Theorems. 

20.1. Convergence of distributions and variables. 一 If we are 
given a sequence of random variables . . . with the d. f:s 

F t (x), F f (a?), • . •，it i$ often important to know whether the sequence 
of d. f：8 converges, in the sense of 6.7, to a limiting d. f. F(x). Thus 
€. g. the central limit theorem asserts that certain sequences of d. f:s 
converge to the normal d. f. d>(x). 一 - In the next paragraph, we shall 
give some further important examples of cases of convergence to the 
normal distribution. 

It is important to observe that any statement concerning the con¬ 
vergence of the sequence of d.f:s (F»(a?)} should be well distinguished 
from a statement concerning the convergence of the sequence oj 
variables {§»}. We shall not have occasion to enter in this book upon 
a full discussion of the convergence properties of sequences of random 
variables. In this respect, the reader may be referred to the books 
by Frechet (Ref. 15) and Levy (Ref. 25). We shall here only use the 
conception of convergence in probability, which will be treated in the 
paragraphs 3 一 6 of the present chapter. 


20.2. Convergence of certain distributions to the normal. — 

I. The Poisson distribution. 一 By 16.5，a variable | distributed in 
Poisson s distribution has the mean 又 ， the s. d. Vx and the c. 1 

沪 - 1 ) The standardized variable —thus has the c.f. 





j* 

As X tends to infinity, this tends to e 2 , and by the continuity theo¬ 
rem 10.4 the corresponding d. f. then tends to <P(x). Thus § is 
asymptotically normal (A, VI). 


2 . The x M distribution. — For n decrees of freedom, the variable 
X % has by (18.1.6) and (18.1.2) the mean n, the 8. d. V2n f and the 


c.f (1 — 2tThus the standardized variable 


TfT 


has the c. f. 


250 



20.2 


and for every fixed 
written in the form 


( 


we may choose n so large that this may be 


where 丨沙 | S 1 • 

As w -► oo, this evidently tends to e 一泛 /2 , and thus the d. f of 
tends to (D (x) y so that y} is asymptotically normal (», V 2 n) 

V 2n 

Consider now the probability of the inequality V~2x 9 < + x, 

which may also be written 


%* < n + lx + 


2fe)^ W - 


As n 


while x remains fixed, 


tends to zero, so that the 


probability of the above inequality tends to the same limit as the 
probability of the inequality ? < w + xV2n, i. e. to d>(x). Thus the 
variable V"2^* is asymptotically normal (V^2n } 1). — According to 
R. A. Fisher (Ref. 13), the approximation will be improved if 
we replace here 2 n by 2 n 一 1 , and consider Wjj* as normally 
distributed with the mean V2” 一 1 and unit 8. d. As soon as n ^ 30, 
this gives an approximation which is often sufficient for practical 
purposes. 

3. Student’s distribution. 一 The fr. f. (18.2.4) of Student’s distri¬ 
bution may be written 

72 ^' 


r 


( 20 . 2 . 1 ) 


Sn [^) = 


VI ^ 




By Stirling's formula (12.5.3), the first factor tends to unity as w ■> <»， 
and for every fixed x we have 


2 


log (1 + |) 


2' 


251 



20.2-3 


bo that 

( 20 . 2 . 2 ) 


Sn (x) 


V 2tt 


*• 

e~T- 


Further, let r denote the greatest integer contained in — ^ •一 Then 
r ^ and thus we have for all n ^ 1 and for all real x 




和 + 誓 


Thus the sequence (5,»(a:)) is uniformly dominated by a function of 
the form 义 (1 + Ja:*)- 1 , so that (5 5.2) gives 

(20.2.3) Sn (x) = J Sn (t)dt -» e~» dt = 0 (jt). 


4. The Beta distribution. 一 Let § be a variable distributed in the 
Beta distribution (18 4.3), with the values np and nq of the para¬ 


meters. The mean and the variance of | are then, by 18 4, 
_ V9 _ 


V 


and 


P + q 

Let now n tend to infinity, while p and q re- 


(p + q) 9 (np + nq -f 1) 
main fixed. By calculations similar to those made above, it can then 
be proved that the fr. f. of the standardized variable tends to the 
x% 

c - a, and that the corresponding d. f. tends to the 


normal fr. f. y= 
normal d. f. <X>(x). 


20.3. Convergence in probability. — Let ... be a sequence 

of random variables, and let F n {x) and 穸, •⑺ denote the d. f. and the 
c. f. of We shall say (cf Cantelli, Ref. 64, Slutsky, Ref. 214, and 
Frechet, Ref. 112) that f converges in probability to a constant c if, for 
any £ > 0， the probability of the relation 一 c \ 〉 f tends to zero as 
n <x>. 

Thus if denotes the frequency v/n of an event £ in a series of n repetition**" 
of a random experiment Bernonlli's theoren/16.3 asserts that v/n converges in 
probability to p. 

A necessary and sufficient conditiofi for the convergence in prob¬ 
ability of to c is obviously that the d. f. tends, for every 

fixed x ^ c t to the particular d. f. e(x — c) defined in 16.1®. 

252 



20.3 — 5 


By the continuity theorem 10.4, an equivalent condition is that 
the c. f. y •⑺ tends for ererj fixed t to the limit 

20.4. Tchebycheff's theorem. 一 We shall prove the following theo¬ 
rem, which is substantially due to TchebjchefF. 

Let f,, 5i» • • • ^ random variables、and let m n and a n denote the 
mean and the s. d. of ^ n . J/ a n 0 as n oo, then •— m n converges 
in probability to zero. 

In order to prove this theorem, it is sufficient to apply the 
Bienaym^-Tchebycheff inequality (15.7.2) to the variable f n —w„. We 

a 蠢 

then see that the probability of the relation | — win | > « is $ j ， 

and bj hypothesis this tends to zero as 

Let us now suppose that the variables f,, 匕 ， • . • are independent, 
and write 


We then have the following corollary of the theorem : If 


(20.4.1) 


2 = o (n»), 


then {一兩 converges in probability to zero. 

The variable | has, in fact, the mean m and the s. d. 


i 


Bj hypothesis, the latter tends to zero as n -► », and thus the truth 
of the assertion follows from the above theorem. 

In the particular case when the ^ are the Tariablei considered in 
16.6, in connection with a series of independent trials, a n is bounded 
and thus (20.4.1) is satisfied. The corollary then reduces to the Poisson 
generalization of Bernoulli's theorem 


20.5. Khlntchine's theorem. — Even if the existence of finite 
standard deviations is not assumed for the variables considered in 
the preceding paragraph, it may still be poisible to obtain a result 
correspondiDg to the corollary of Tchebjcheffs theorem. We shall 
only consider the case when all the have the same probability 
distribution, and prove the following theorem due to Khintchine 
(Ref. 139). 


253 



20.5-6 


Let H … .be independent random variables all having the same 
d.f. and suppose that i^(a?) has a finite mean m. Then the variable 

f = - 2 5* converges in probability to m. 

l 

If g>(t) is the c. f. of the common distribution of the the c. f. 
of the variable f is According to (10.1.3)，we have for #-*•() 

+ mit o(0 ， 

and thus for any fixed t t as n -> oo, 

According to 20.3, this proves the theorem. 


20.6. A convergence theorem. — The following theorem will be 
useful in various applications : 

Let 5i, 5i，• • . be a sequence of random variables y with the d.f:s 
F u F t , . . .. Suppose that F n (a?) tends to a d.f. F(x) as n oo. 

Let t]^ tj，，• . . be another sequence of random variables、and suppose 
that rjn converges in probability to a constant c. Put 

(20.6.1) x„ = + Vn , r B = |„i2n, z，, J n _ 

Vn 

Then the d.f. of X» tends to F(x — c). Further，if c > 0, the d.f. of 
Yn tends to while the d.f. of Z n tends to F(cx). (The modifica¬ 

tion required when c < 0 is evident.) 

It i3 important to observe that, in this theorem, there is no con- 
dition of independence for any of the variables involved. 

It is sufficient to prove one of the assertions of the theorem, as 
the other proofs are quite similar. Take, e. g., the case of Z„. Let 
a; be a continuity point of F(cx\ and denote by P n the joint prob¬ 
ability function of and rj n . We then have to prove that 

P» (卜如 

as n oo. Now the set S of all points in the (|«，”，,)-plane such that 

254 



20.6 


11?» — c I ^ e, 

I >7n ~ C I > £. 


Thus we have P n (^) = Pn(^j) + Pn(S t ). Here S % is a subset of the 
set I — c I > €, and thus by hypothesis P n (S M ) -► 0 for any c > 0. 

Further, P n (5i) is enclosed between the limits 

Pn (?n ^ (c ± e)X y 11?» — c I ^ 

Each of these limits differs from the corresponding quantity 
Pn(5n ^ (C + €)x) = F n ((C ± e)x) 

by less than P n (| —- c | > e). As n -> oo, the latter quantity tends to zero, 
and we thus see that P n (S) is enclosed between two limits，which can 
be made to lie as close to F(cx) as we please, by choosing e sufficiently 
small. Thus our theorem is proved. 

Hence we deduce the following proposition due to Slutsky ( Ref. 
214 )： If Qn are random variables converging in probabi¬ 

lity to the constants x ， y，. • .、r respectively、any rational function 
丑 （ §n ，”，“•••， 办 ） converges in probability to the constant R (x, y, . . r), 

provided that the latter is finite. It follows that any power 
R l (§ n , f]n t . . .» Qn) tvith k>0 converges in probability to R k (x, y, . . ?*). 

Exercises to Chapters 15-20. 

1. The variable ^ has the fr. f. f(x). Find the fr. f s of the Tanables — \ 

s 

and ^ = cos Give conditions of existence for the moments of rj and 

fc 

2. For any k > 1 ， the function f{x) — -i — rrrT f. with the range 

2 、1 十丨 x | 产十 1 

(—co, co). Show that the n.th moment exists when and only when n < k. 

3. The inequality (16 4.0) for the absolute moments is a particular case of 
the following inequality due to Liapounoff (Ref. 147). For any noo-negative n, p f q 
(not necessarily integers), we have 

1o k ^ 7T" 9 log * 5n+ 八 +P+«. 

255 


—^ a: is the sum of two sets S t and S M without common points ， 
Vn 

defined by the inequalities 


a:a: 

VII VII 

il 加 l» 如 

4 ^ 



Exercises 


For n = 0 ， 發 * 1, this redaoes to (16.4.6), since /^ 0 = 1. The general inequality ex- 
prettes that % chord joining two points of the curre y =« log (x > OX Ii®« entirely 
美 bore the curve, so that \o% fix is a convex (unction of x. (Fur • detailed proof, see 
«. g Uspensky, R«f. 89, p. 266.) 

4. When ff(x) ia never increasing for x > 0, wc bavo for any k> 0 


k t fg(x)dx ^ i / x l g(x)dx 

k 6 


First prove that the inequality is true in the particular case when ^(ar) is constant for 
0 < x < c, and equal to zero for x> c. Then define a function h{x) which is con 
•tantly equal to g\k) for 0 < x< 丨十 a, and eqnal to zero for * > * + a, where a It 

00 

determined by tb« condition ag {k) — J g{x)dx, and show that 

L 


oo 00 00 oo 

k' f g(x)dx = k % f h(x)dx f x % h(x)dx^ f f x % g(x)dx. 

k k o o 


Uae this result to prove the inequalities (16.7.3 N and (16.7 4). 

5. If F(x) is a d. f. with the mean 0 and the 8. d. a, we have F(x) S 
r t 

for x < 0, and Fix) ^ - : for x > 0. For x < 0, this follows from the inequalities 

<!• + »* 


—® = / (y-x)dFS f ( v ~x)dF, 

— 00 X 

to eo oo 

.«* s (/(y-x)dJf)* s / dr- /(y - »)* dF^{l~ F(*))(o* + X»). 


For * > 0, the proof U similar. Show by an example that these inequalities cannot 
be improved. 

6. The Bieaaym^-Tchebycheff inequality (16.7.2) may be improved, if some 
central moment ft 2n with n > 1 is known We have, e. g, for i > 1 

P ( 拎 -m\^ka)^ = (jfct _ l)t + y| - 十 2 . 

Apply (15.7.1) with K = l and 1 + ~~~~ • 

7. Use (16.4.6) to show that the semi'inTariaot x n of an arbitrary distribution 
satisfies the inequality |x n | ^ n n /3 n . (Cramer, Ref. 11, p. 27.) 

8. Prove the inequality |a -f 6 | rt ^ 2 n 一 1 (|a + | b | n ). Hence deduce that, if 

the n th moments of x and y exist, so does the n tb moment of x + y 

256 



Exercises 


,• Writing 0(p,q)^ J ( W )P r 9”- r » *how that the first absolute moment 
r>np ' ' 

about the mean of the binomial distribution is 




where fi is the smallest integer > np. For large n, it follows that 

E(\v-np\)(X> J/^ 


10. Show that if 1 — F(r) => 0 («-«*) as x-* 4 - «, and F{x) — 0(e-«l x l) os 
x~~* — oo (c > 0), the distribation is uniquely determined by its moments. 


11. The factorial moments (Steffensen, Ref. 217) of a discrete distribution are 
a 卜 】•- 2j> r arj.^, where a ;【*】 denotes the factorial a; (x — 1). .. (x — y + 1). Similarly 

the central factorial moments are /U【，】=S _ m ) 卜 】. Express a 卜】 and 芦 [ 會 】 by 

r 

means of the ordinary moments. Show that (x-f • • • -f 

and henc« deduce relations between a 卜 】 and #【，]• 


12. The c. f. of the distribation in the preceding exercise is <p(t) *= ^p r e itrr , 

r 

Substituting here t for tf i t we obtain the generating function yf (t) = S p r f r . Show 

r 

that V(’》 ⑴ 8=5 a【，】，*nd in particular E(x) ^ (1), D 1 (a?) v»’’(l) + 〆 ⑴一 （ V’(l))’. 

Use this result to deduce the expresBions a 卜】 =n[’ 】 p* for the binomial distribution, 
and ™ X 9 for the Poisson distribution. 


13. a) We make a series of iDdepecdent trials, the probability of a »8uccess» 
being in each trial equal to jp — 1 — and we go on until we have had an aninter- 
rapted set of v successes, where v > 0 is given. Let p nv denote the probability that 
exactly n trials will be required for this purpose. Find the generating function 


flBl 


P 9 tW -pt) 

1 一 t + p 9 qt 9+l 


and Bhow that E (n) »= y’ （ 1) -——— . 

P 9 9 

b) On the other hand, let as make n trials, where n is given, and observe the 
length fA of the longest uninterrupted set of success occurring in the course of these 
n trials. Denoting by P nv the probability that fi < v, show that 

P nv = 1 -Pli - Pnv* 

and thus 


柳 = i p„,' = 

舞 



i-p 9 r 

1 


17—454 H. CranUr 


257 




Exercises 


H«nce it can be shown (Cramer, R«f. 68 ) that P n ，— tends to sero as n — oo, 

Vlaiformly for 1 彡 v 彡 n. It follows that for large n w« have 


⑴， O'W-O(l). 


l0 S 


14. The variable S is normal (m, a). Show that the mean deviation is 

£(| |-m |) = (J = 0 79788o. 

15. In both cases of the Central Limit Theorem proved in 17.4, we have 

£： I I — yl as n - ». — Use (7.6.9) and (9 6 1) (Cf Ex. 0.) 


16. Let I,, . . . be independent variables, sach that has the possible values 

0 and ± v°, the respective probabilities being 1 — v 2a , J v _2a , and } v~ 2a . Thus 
has the mean 0 and the s. d 1 Show that the Liapounoff condition (17.4 8 ) is 

n 

satisfied for a < J, but not for ee 3 蚤 . Thus for a < J the sum 孴 =S 卷 , asymp- 

— 1 
totically normal (0, Vn). For « > J, the probability that f„ = 0 does not 

tend to zero as n— co, so that ia this case the distribution of ^ does not tend to 
normality. The last result bolds also for « = §; cf Cramer, Ref. 11, p. 62. 


17. If a, and a t are the two first momenta of the logarithmico-normal distribu¬ 
tion (17 5.3), and if rj is the real root of the equation »/* -f 3 — y 1 = 0, where y t 
is the coefficient of skewness, the parameters a t m and c of the distribution are 
given by 


a T~' 0，= ,0ga + n 


rt « ai 

m = log (a, — a) — 1 a*. 


18. Consider the expansion (17.6.3) of a fr f. f{x) in Gram-Chariier series, and 
x* 

1 - 1 

take f{x) = — r= e 2o *. For ac = 0, we have /(0) =» — 7 =, and the expansion be- 

aVin ay 2n 

comes 

-J— = --L V - <,*)* 

aV2n y^2^(v\r ； 

This is, however, only correct if a 9 ^ 2. For a* > 2, the series is divergent. Find a 
and ft such that af{x) 4 - sr) is the fr. f. of a standardized ▼arUble, and show 

by means of this example that the coefficient ^ in the convergence condition (17 0.6 ft) 
cannot be replaced by any smaller number. 

H. Calculate the coefficieuts y, and y t for the varioiis distributions treated in 
Ch. 18. 


258 



Exercises 


30. If th« yariable rj is uniformly distributed orer (a — A，a + A), the c. f. of ” is 

If { is an arbitrary variable independent of fj, with the c. f. ^(t), the 
A t 

8 am $ + ” has the c. f. —(f). Show that, by the aid of this resalt, the 
A t 

formula (10.8.8) may be directly deduced from (10.8.1). 

21. Let n be a random Tariable having a Poisson distribution with the probabilities 
x* 

«-*, where y = 0, 1,. . .. If we consider here the parameter x as a random variable 
with the fr. f. e -aa， » (x > 0), the probability that n takes any given value v is 


* x ， , ee^ , , „ , 

— y e~ x • e~ a *rt«r : 




Find the c. f., the mean and the s. d. of this distribatioD, which is known as the 
negative binomial dutrtbution. 


22, x lt x tt . . . are independent variables having the same distribution with the 
mean 0 and the 8. d. 1. Use the theorems 30.6 and 20.6 to show that the variables 

.._ l cr I ' + '" + x « 1 •+ •• +x » — la . . 


JB; + … + xj 




are both asymptotically normal (0,1). 


33. If x H and y n are asymptotically normal (a f lilVn) and (6, k/Vn) respectively» 
where 6^0, then the variable r n « Vn (x n — a)/y n is asymptotically normal (0, h/b). 
一 Note that there is no condition of independence in this case. 


259 



Chapters 21 — 24. Variables and Distributions in R n . 


CHAPTER 21. 


The Two-Dimensional Case. 

21.1. Two simple types of distributions. — Consider two one-di¬ 
mensional random variables § and rj. The joint probability distribution 
(cf 14.2) of f and t? is a distribution in JR„ or a two-dimensional dis¬ 
tribution. This case will be treated in the present chapter, before we 
proceed to the general case of variables and distributions in n dimen¬ 
sions. 

According to 8.4, we are at liberty to define the joint distribution 
of S and rj by the probability function P(5), which represents the 
probability of the relation (g, i?) C 5， or by the distribution function 
F [x, y) given by the relation 

F(x 1 y) = 

We shall often interpret the probability distribution by means of 
a distribution of a unit of mass over the (g, ^)-plane. By projecting 
the mass in the two-dimensional distribution on one of the coordinate 
axes, we obtain (cf 8.4) the marginal distribution of the corresponding 
variable. Denoting by F x (x) the d. f. of the marginal distribution of 
g, and by F t (y) the corresponding function for rj. we have 

Ft (x) = P(§ ^x) = F{x, oo), 

(y) = P(f) ^y) = F(oo t y). 

As in the one-dimensional case (cf 15.2), it will be convenient to 
introduce here two simple types of distributions: the discrete and the 
continuous type. 

^ 1. The discrete type. A two-dimensional distribution will be said 
to belong to the discrete type, if the corresponding marginal distri¬ 
butions both belong to the discrete type as defined in 15.2. In each 

260 


21.1 


marginal distribution, the total mass is then concentrated in certain 
discrete mass points, of which at most a finite number are contained 
in any finite interval. Denote by x x , and by y u y t 、 … the 

discrete mass points in the marginal distributions of 5 and rj respec¬ 
tively. The total mass in the two-dimensional distribution will then 
be concentrated in the points of intersection of the straight lines 
§ = Xi and rj = y*, i. e. in the points (x“ y*), where i and k indepen¬ 
dently assume the values L, 2, 3,... If the mass situated in the point 
(xi, t/k) is denoted by we have 

(21.1.1) P(5 = Xi y T] = yi) = p 1 k, 

while for every set S not containing any point (xi, yk) we have 
P(5) = 0. Since the total mass in the distribution must be unity, we 
always have 

2?“. = 1 - 

«,* 

For certain combinations of indices i, k we may, of course, have 
p ik = 0. The points (xt, yk) for which 少 , t > 0 are the discrete mass 
points of the distribution. 

Consider now the marginal distribution of the discrete mass 
points of which are x u x s ,.. . If p,. denotes the mass situated in the 
point a: “ we obviously have 

( 21 . 1 . 2 ) p,. = P(^x,)^2P' k - 


Similarly, in the marginal 
mass p.k given by 

(21.1.3) P k== 


distribution of rj, the point y*. carries 

p (”= 抑) = 2辦 . 


the 


By (15.11.2), a necessary and sufficient condition for the indepen¬ 
dence of the variables § and rj is that we have for all i and 1c 


(21 .冰 

f 2A)The continuous 
said to belong to the continuous type, if the d. f. F [x, y) is every¬ 
where continuous, and if the fr. f. (cf 8.4) 


Pik = Pt. jpjt. 

type. A two-dimensional distribution will be 


/(*. y) 


d*F 
dx dy 


261 



21.1-2 


exists and is continuous everywhere, except possibly in certain points 
belonging to a finite number of curves. For any set 5 we then have 

P(S) = ff(x t y)dx dy, 
s 

and thus in particular for S = 

00 00 

/ / f(x,y) dxdy = 1. 

— 00 — 00 

The marginal distribution of the variable § has the d. f. 

X OO T 

p(^x) = f f/{t, w) 以 rf“ /y; ⑺心， 

— oo — oo —oo 

where 

( 21 . 1 .5) /i (^) = jf(x,y) Hy. 

~ oo 

If, at a certain point x = x 0l the function f(x } y) is continuous 
with respect to x for almost all (cf 5.3) values of y and if, in some 
neighbourhood of x 0f we have f(x } y) < G (y)，where G (y) is integrable 
over (-— oo, oo), then it follows from (7.3.1) that f x (x) is continuous at 
x = x 0 In all cases that will occur in the applications, these condi¬ 
tions are satisfied for all x 0i except at most for a finite number of 
points. In such a case f t (x) has at most a finite number of discon¬ 
tinuities, so that the marginal distribution of $ is of the continuous 
type and has the fr. f. f t (a:). Similarly, we find that the marginal 
distribution of rj has the fr. f. 

( 21 . 1 .6) /i(y) = //(^, y)dx. 

— OO 

By (15.11.3), a necessary and sufficient condition for the indepen¬ 
dence of the variables f and rj is that we have for all x and y 

( 2U . 7 ) /(*. y)=/t (*)/«(y)- 

21.2. Mean values, moments. — The mean value of a function 
9 (?» v) integ^rable over JR, with respect to the two-dimensional pr. f. 
P(5) has been defined in (15.3.2) by the integral 

(2 1 . 2 . 1 ) E(g(lrj)) f/)rfP(5). 

262 



21.2 


For a distribution belonging to one of the two simple types, this 
reduces to a sum or an ordinary Lebesgue integral, as indicated in 
15.3 for the one dimensional case. The fundamental rules of calcula¬ 
tion for mean values have already been deduced in 15.3 for any num¬ 
ber of dimensions. 

The moments of the distribution (cf 9.2) are the mean values 


( 21 . 2 . 2 ) 


Oik : 


■E^^)=^^dP(S), 


where i and k are non-negative integers. The sum i + A: of the in¬ 
dices is the order of the moment a t k. 

The moments aso == E (50 and ao* = E «) are identical with the 
moments of the one-dimensional marginal distributions of § and r) 
respectively, as shown by the integral relation (9.2.2). In particular, 
we put 

«io = -^(5) = m i» = 

The point with the coordinates 5 is the centre of gravity 

of the mass of the two dimensional distribution. For the moments 
about the centre of gravity we shall use a particular notation, writing 
in generalization of (15.4.3) 

(21.2.3) fin = E((g — mjirj — m t ) k ). 

Thus in particular we have fi l0 = ^i 01 == 0 and = <rj, where 

a v and a % are the standard deviations of § and rj. 

Between the moments a,t and the central moments fin we have 
relations analogous to those given in 15.4 for the one-dimensional 
case. Thus for the second ord6r moments we have 


(21.2.4) fi t0 — a i0 — ml , = a n — rriy m f , /i 0 * ^ c o* ■"* - 

fi n is often called the second order product moment or mixed moment. 
Further, while n t0 and /i os are the variances of § and rj, the product 
moment is also called the covariance of § and rj. 


In the particular ca*e when the variables $ aud rj are independent, we have by 
the muliiplication theorem (16.3.4) a tk = « l0 « ojk and fi lk — fi io /u ok . Thus in particular 
we have in this case ft u = /i l0 ^ = 0. 

For any real t and u we have 

(21.2.5) w»i) + u(tj — w*))’] = fito t f + 2 fi n tn + ^o* w*. 

263 



21.2 


The first member of this identity is the mean value of a square, 
and is thus non-negative. It follows that the second member is a non¬ 
negative quadratic form (cf 11.10) in t and u, so that the moment 

matrix M = \^ li0 1 is non-negatife, and we have 

I Mil ^o« J 

(21.2.6) Mio^oi ^ 

The rank r of M may (cf 11.6) have one of the values 0, 1 and 2. 
When r = 2, we have the sign > in (21.2.6), while the sign = holds 
for r = 1 and r = 0. We shall now show that certain simple proper¬ 
ties of the distribution are directly connected with the value of r. 

We have r = 0 when and only when the total mass of the distribution 
is situated in a single point. 

We have r == 1 when and only when the total mass of the distribu¬ 
tion is situated on a certain straight line, but not in a single point. 

We have r = 2 when and only when there is no straight line that 
contains the total mass of the distribution. 

It is obviously sufficient to prove the cases r = 0 and r = 1, as 
the case r = 2 then follows as a corollary. - — When r = 0, we have 
/i i0 ― fi 0i = 0, so that the marginal distribution of each variable has 
its total mass concentrated in one single point (cf 16.1). In the two- 
dimensional distribution, the whole mass must then be concentrated 
in the centre of gravity (w„ w,). Conversely, if vpe know that the 
whole mass of the distribution belongs to one single point, it follows 
immediately that fi w = fi ox = 0, and hence by (21.2.6) = 0, so that 

M is of rank zero. 

Further, when r = 1, the form (21.2.5) is semi-definite (cf 11.10), 
an thus takes the value zero for some t== t 0 and u = u 0 not both 
equal to zero. This is only possible if the whole mass of the distribu¬ 
tion is situated on the straight line 

(21.2.7) — + u 0 (rj — m 8 ) == 0. 

Conversely, if it is known that the total mass of the distribution is 
situated on a straight line, but not in a single point, it is evident 
that the line must pass through the centre of gravity, and thus have 
an equation of the form (21.2.7). The mean value in the first mem¬ 
ber of (21.2.5) then reduces to zero for t = t Qy u == t/ 0 , so that the 
quadratic form in the second member is semi-definite, and it follows 
that M is of rank one. Thus our theorem is proved. 

264 



21.2-3 


Let us now suppose that we have a distribution such that both 
variances and are positive. (This means i. a. that M is of 
rank 1 or 2.) We may then define a quantity g by writing 


( 21 . 2 . 8 ) 


Mil ^ Mn 

/*o. a » a * 


Bj (21.2.6) we then haye q 9 ^ 1, or — 1 ^ ^ ^ 1. Further, the case 
穿 , =1 occurs when and only when M is of rank 1 ， i. e. when the 
whole mass of the distribution is situated on a straight line. — In 
the particular case when the variables 5 and fj we independent, we 
have fi n = 0 and thus ^ = 0 . 

The qnantitj q is the correlation coefficient of the variables $ and 
t}\ this will be further dealt with in 21.7. 

Suppose that we are given any quantities w,, w f , and any fin, /u 9 t subject 
to the restriction that the quadratic form /u t o ^* + 2 f*n tu + u 9 is non-negatire. 
We can then always Hod a distribution having nt!, for its first order moments 
and fin, fj 的 for its second order central moments. The required conditions are, 

1 -4- p 

e. g., satisfied by the discrete distribution obtained by placing the mass —一 in 

1 一 p 

each of the two points (m! 4 - o lt m t + a%) and (»h — m* — <J t ), and the mass --j— 

in each of the two points {ni x + a t , m % — a t ) and.(nit — a it m t + a t \ The quantities 
a x , a t and p are here, of course, defined according to the above expressions. 


21.3. Characteristic functions. — The mean value 

(21.3.1) 9 >(<, u) = E(e , ^ +u i ) )=^dP 

is the characteristic function (c. f.) of the two-dimensional random 
variable (5, p), or of the corresponding distribution. We shall also 
often call g> (t, u) the joint c. f. of the two one-dimensional variables 5 
and rj. 

According to the theory of c. f:s given in Ch. 10, the one-to-one 
correspondence between one-dimensional distributions and their c. £:s 
(cf 15.9) extends itself to distributions in any number of dimensions. 
If two distributions are identical, so are their c. f: 8 , and conversely. 

If the second order moments of the joint distribution of $ and 17 
are finite, we have in the neighbourhood of the point < = 11 = 0 the 
development analogous to (10.1.3) 

265 



21.3 


( 21 . 3 . 2 ) y (/, m) = 1 + j| (cr 10 1 + a 01 u) + (c i0 t f + 2 ct" < m + a os u’} + 

4 - o(t % 4 - u f ) = ^ (w » <+m * w) |^l + + 2 AdM + jii 0 s M*) 十 o(P + tt*)]. 

In the particularly important case when the mean values m l and w, 
are both equal to zero, we thus have 

(21.3.3) y (<,u) = l-i(/i 10 <* + 2/x n tu + ft^u*) + (>(<’ + «’). 

The c. f ：8 of the marginal diatributions of ^ and 巧 are 

(21.3.4) = tp (<, 0), and E(e—) = q> (0, u). 

If the variables 5 &nd rj are independent, we have 

q>(t,u) = E(e u ^ 〆” ） = E ㈣. E(e^) t 

so that the joint c. f. q> (t y u) is the product of the c. f:a of the mar¬ 
ginal distributions corresponding to 5 and rj respectively. 

Conversely, suppose that it is known that the joint c. f. of 5 and 
rj is of the form <p x (<) - ip % (m). Introducing, if necessary, a multipli- 
csktive constant into the factors, we may obviously assume (p x ( 0 )= 
= 沪 “ 0 )= 1 ， and then it follows from (21.3.4) that (t) and 95 , (ti) 
are the c. f:s of | and rj respectively. If the two-dimensional interval 
defined by a, < | < a 9 < rj < b t is a continuity interval (c£ 8.3) of 
the joint distribution of ^ and rj, it further follows from the inversion 
formulae (10.3.1) and (10.6.2) that we have the multiplicative relation 

P(a t < § < , a t <tj<b f ) = P(a x < ^ < b t ) ^ P(aj < rj < b 9 ). 

Allowing here a x and a, to tend to — 00 ， we obtain in particular, 
using the same notations as in 21.1 ， F{x, y) = F t (x) F t (y) for all x 
and y that are continuity points of F t and F t respectively. By the 
general continuity properties of d. f ： 8 , this relation is immediately 
extended fco all x and y. From (14.4.5) it then follows that the 
variables 5 and t] are independent, and we have thus proved the 
following theorem : 

A necessary and sufficient condition for the independence of two one¬ 
dimensional random variables is that their joint c. f. is of the form 

(21.3.5) fp(t, w)== 9 Pi 

266 



21.4 


21.4. Conditional distributions. — The conditional distribution of 
a random variable rj t relative to the hypothesis that another variable 
5 belongs to aome given set S 、has been deGned in 14.3. In the 
present paragraph, we shall consider this question somewhat more 
clog^^fSr distributions of the two simple types introduced in 21 . 1 . 

1. The Pnwoi/ioy the discrete distribution defined by 

(21.1.iy, and let Xt be a value such that the marginal probability 
P= xi) ptk Pi. is positive. The conditional probability of the 

erent rj = yt, relative to the hypothesis ? = x,, is then by (14.3.1) 


(21.4.1) 




= rj = ifk) ___ pa 
=^<) 一 Pf. 


For a fixed x“ the conditional probabilities of the various possible 
valuen of y* define the conditional distribution of rj, relative to the 
hypothesis ^ == Xi. The sum of all these conditional probabilities is, 
of course, equal to 1 . 

If the (5, ^-distribution is interpreted in the usual way ag a dis¬ 
tribution of a unit of mass over the points (x,, y*), the conditional 
distribution is obtained by choosing ft fixed x x and multiplying each 
mass situated on the vertical through the point ^ = x, by the factor 
l/p t ., so as to make the sum of all the multiplied masses eqmrl to 
unity. / 

The conditional mean value of a function g (5, tj), relative to the 
hypothesis J = x/ t is defined as the mean value of g(xi, i?) With respect 
to the conditional distribution of rj defined by (21.4.1): 

S'(•».,.vt) 

(21.4.2) = -y--- 

fc 尸 

For g (f, 17 ) — rj t we obtain the conditional mean of rj 、which is the 
ordinate of the centre of gravity of the mass situated on the vertical 

1 = 农: 

Z Pvk yk 

(21.4.3) — . 

LPtk 

k 

On the other hand, taking ： g (^ y rj) = (rj — m^)*, we obtain the condi¬ 
tional variance of rj. 


267 



21.4 


The conditional distribution of relative to the hypothesis rj = 
and the corresponding conditional mean values, are defined by per¬ 
mutation of the variables in the expressions given above. 

In the particular case when § and rj are independent, (21.1.4) shows 
that we have ptk = and this gives ns 


(21.4.4) 


= = =p t . = P(g = ^), 

P (v = yk\^ = Xi)=p. k = P(i] = yir) } 


in^ocordance with the general relations (14.4.2) and (14.4.3). 

1. Jbs^con tinwm typ e. ^Let f(x } y) be the joint fr. f. of the vari¬ 
ables and rj. Consider an interval (x t x + h) such that the mass situated 
in the vertical strip x < ^ < x + h t which represents the probability 

ar+A oo 

P(x < ^ < x + h) = f f f(x, y) dx dy y 


is positive. The conditional probability of the event rj^y, relative to 
the hypothesis x <^< x h, is then by (14.3.1) 


P(rj ^ y\x < ^<x ^h) 


P(x <^<x-¥h, rj^y) 
P(x <x h) 


v 

/ J/(^y)dxdy 

X —OP _ 

X+h oc 

/ f/[x,»)dxdy 


This is the d. f. corresponding to the conditional distribution of i} y 
relative to the hypothesis a : 〈客 < x + 办 • It is simply equal to the 
quantity of mass situated in the strip a: < f < x + 厶 and below the 
line = y, divided by the total mass in the strip. Let sow h 
tend to zero. If the continuity conditions stated in connection with 
(21.1.5) are satisfied at the point x t and if the marginal fr. f. f x (x) 
takes a positive value at the point x t it follows from (5.1.4) that the 
conditional d. f. tends to the limit 


(21.4.5) lim P(tj ^y\x<^<x h) = : 


f/(x,ij)dr] ff(x, t])dTj 


ff(x,Tj)dri 


/iW 


For fixed x % the limit is evidently a d. f. in y, and this will be called 
the conditional d.f. of rj, relative to the hypothesis § = x. 

268 



21.4 


If f(x, y) is continuous in the conditional d. f. may be differen¬ 
tiated with respect to y, and we obtain the correspond conditio”a! 
fr.f. of rj: 


(21.4.6) 


f(y\x) 


f(x,y) ... f(x, y) 

- 00 


The conditional mean value of a function g (J, tj), relative to the 
hypothesis J == a?, is in this case 


五 [ 穿 (5,”)|1 = 糾 =/ g(x, y)f(y\x)dy- 


/ 咖， y)f(^y)dy 
— 00 

/|(*) 


Multiplying by yi (a:) and integrating with respect to x 1 we obtain 

00 oo 00 

<21_4.7) Eg(^ 1 ?)=/ fg(x,y)f(x,y)dxdy = fE[g(^rj)\^ = x)f t (x)dx. 
— 00 -—00 — 00 

The conditional mean and the conditional variance of tj are 

oo 

/ yf[x,y)dy 

(21.4.8) E (rj \ ^ == x) = m i (x) = -- ， 

jf(x,y)dy 
— 00 

00 

f (y —m,(a:))*/(x,y)rfy 

(21.4.9) 0*(1711 = *) = --~= - 

jf(x,y)dy 
— 00 


The point with the coordinates J = i? = wi 雪 (a?) is the limit, for 
厶 — 0, of the centre of gravity of the mass in the strip x < ^ < x h. 

The conditional distribution of ^ for a given value of rj t and the 
corresponding conditional mean values, are defined in a similar way. 
Thus e. g. the conditional fr. f. of relative to the hypothesis tj = y y is 


(21.4 10) 


/(*|y) = 



-00 


/(^.y) /iW/(yk) 
1M — ~7T(y) 


269 



21.4-5 


while the conditional mean E(^\rj = p) = m 1 (y) is the mean of f cor¬ 
responding to the fr. f. f(x \ y). 

If 5 and rj are independent, we have f[x, y) =/, (x)f t (y). It follows 
that in this case the conditional fr. f. of either variable is indepen¬ 
dent of the hypothesis made with respect to the other variable, and 
is identical with the fr. f. of the corresponding marginal distribution. 
Accordingly the conditional mean values for both variables agree with 
the mean values in the marginal distributions: 

(21.4.11) m i (y) = m u m t W = rn v 

21 .^JELe^jegail^r I. 一 Let § and tj be random variables with a 
joint distribution of the continuous type, and suppose that the cor¬ 
responding fr. f. f(x, y) satisfies the continuitj conditions stated in 
connection with (21.1.5) for everj x such that the marginal fr. f. f^x) 
is positive. 

According to the preceding paragraph, the conditional fr. f. /(y | x) 
given by (21.4.6) then represents the distribution of mass in an in¬ 
finitely narrow vertical strip through the point ^ = x. We may here 
think of ^ as an independent variable; to a fixed value § = a; then 
corresponds a probability distribution of the dependent variable rj y with 
the fr. f. f(y | x). 

Consider now some typical value of this conditional ^-distribution, 
such as the mean, the mode, the median etc. Generally this value 
will depend on x, and may thus be denoted by y x . As x varies, the 
point (x, y x ) will describe a certain curve. From the shape of this 
curve we obtain information with respect to the location of the condi¬ 
tional rj distribution for various values of (Cf fig. 22 a.) 

A cur?e of this type will be called a regression curve, and will be 
said to represent the regression of rj on 5- In the sequel we shall al¬ 
ways, unless explicitly stated otherwise, choose for y x the conditional 
mean m t (x) of the variable rj, as given by (21.4.8), and so obtain the 
regression curve for the mean of r} as the locus of the point [x, m t (a:)) 
when x varies : 

(21.5.1) y = m f (x)==E(^|§ = x). 

If, instead of we consider rj as our independent variable, the 
conditional fr. f. of the dependent variable § for a fixed value t] = 
is given by (21.4.10). Any typical value x y of the conditional distri¬ 
bution of § gives rise to a regression curve representing the regression 

270 



21.S 




F,ig. 22 a) Rogreasion of rj on b) Regression of { on y; 


of f on r\. (Cf fig：. 22 b.) Thus the regression curve for the mean of 
^ is the locus of the point (m^y), y) when y varies, and has the equation 

(21.5.2) x^=m l (y) = E(^\t}=--y). 


The two regression curves (21.5.1) and (21.5.2) will in general not 
coincide. In many important cases occurring in the applications, both 
regfression curves are straight or at least approximately straight lines. 
Thus e. g. in the particular case when f and rj are independent, it 
follows from (21.4.11) that the regression curves are straight lines 
parallel to the axes and passing through the centre of gravity (w„ w 耋 ). 
— When a regression curve is a straight line, we shall say that we 
are concerned with a case of linear regression. 

The regression, curves (21.5.1) and (21.5.2) possess an important 
minimum property. — Let us try to find, among ： all possible functions 
y(5) of the single variable the particular function that gives the 
best possible representation or estimation of the other variable rj. Inter¬ 
preting the expression »best possible» in the sense of the least squares 
principle (cf 15.6), we then have to determine 卩 (5) so as to render 
the expression (cf 21.4.7) 


(21.5.3) 


eo oo 

—»(!)]* = / / [y — 9{x)Vf{x,y)dxdy 

— oo —oo 

=ffi (*) dx / [y - g (*)]V(y \x)dy 

— OO —00 


271 




21.5-6 


as small as possible. By 15.4 the integral with respect to y in the 
last expression becomes, however, for every value of a: a minimum when 
g (x) is equal to the conditional mean m t (a:). Thus the minimum of 
E[t}— g (§)]*，among all possible functions g (g), is attained for the func¬ 
tion g (5) = w, (§), which is graphically represented by the regression curve 
(21.5.1). — Similarly, the expression E[^ — h (^)J* attains its minimum 
for the function h (rj) = m x (i^), which corresponds to the regression 
curve (21.5.2). 

Similar definitions may be introduced in the case of a dUtribation of the discrete 
type, as given by (21.1.1). For every value x { of such that the marginal probability 
p t is positive, the conditional distribution of rj is given by (21.4.1). Let us consider 
some typical value of this distribution, e. g. the conditional mean mj) given by (21.4.8). 
When I assumes all possible values sc t , we thus obtain a sequence of points (x it mW) 
represeniiDg the regression of 17 on ^ Conversely, tbe regression of ^ on i; is re¬ 
presented by the sequence of points (mW, y^), where m(*) is the conditional mean of 
爸 ， relative to tbe hypothesis tj = y k . In either case, we may connect the points cor¬ 
responding to consecutive values of t or A: by straight lines, and consider tbe corves 
thus formed as the regression curves of the discrete distribution. 

21.6. Regression, II. — In the literature, we often find the name 
of regression curves applied also to another type of curves than that 
introduced in the preceding paragraph. We shall now proceed to a 
discussion of this other type of curves. 

In the minimum problem considered in connection with (21.5.3), 
we tried to find, among all possible functions g (§), one that renders the 
mean value of the square (rj — g ©)* as small as possible, and w« have 
seen that the solution of this problem is given by the regression 
curve (21.5.1). Instead of considering all possible functions 夕 (§) we 
may, however, restrict ourselves to functions belonging to some given 
class, such as the class of all linear functions, all polynomials of a 
given degree n } etc. Thus we require to find, among all functions 
g (5) belonging to such a class, one that gives a best possible represen¬ 
tation of 7] according to the principle of least squares. In such a 
case, the minimum problem may still have a definite solution, but 
this will generally correspond to a curve different from the re¬ 
gression curve (21.5.1). Curves obtained in this way will be denoted 
as mean square regression curves, or briefly m. sq. regression curves. 1 ) 

The simplest case is that of the linear m.sq. regression. Here we 
propose to find the best linear estimate of rj by means of i. e. the 
linear function g(^) = a + # £ that renders the mean value of the square 

*) When the meaning is clear from the context, we shall often drop the »m. sq.». 

272 



, 21.6 

(ij 一 g (J)) 1 at small as possible. Now we may write, using the nota< 
tions introduced in 21.2, and assuming Ao > 0, 的 >■ > 0, 

^ E (v ~ a — ^ E(rj — m, — /J(§ — m,) + m, — a — /Jwi ,) 1 

: = ^toP 1 一 2 心 /? + /io* + (m 奮一 or — /9 m,)*. 

An easy calculation shows that the minimum problem has a unique 
solation gfiven by 

(21.6.2) = =^ L， a = m 8 — m l} 

^to a x 

where q is the correlation coefficient defined by (21.2.8). Thus the 
m. sq. regression line of rj has the equation 

(21.6.3) y = m t + ^ (x — mj). 

The line passes through (m,, tw,), and the equation may also be written 

(21.6.4) 

(7 ， a x 

We note that this line is defined for any distribution such that both 
variances are finite and positive, and not as the regression curves of 
the preceding paragraph for distributions of the two simple tjpea onlj. 

The quantity defined by (21.6.2) is the regression coefficient of 
tj on f. When the values of a and 〆 given by (21.6.2) are introduced 
in ( 21 . 6 . 1 ), the latter expression assumes its minimum value 

(21.6.5) 尽 ai n (9 — a — /??)* = —^ -- =o}(l — 〆) • 

Mto 

The expression E(rj — a — /?{)* ==^(tf — a — ^x) % dP may be con¬ 
sidered as a weighted mean of the square of the vertical distance 
y — a —between a mass particle dP with the coordinates (x, y) 
and the straight line y == a + fix. Since this mean becomes a mini- 
tnmn for the regression line (21.6.4), this line may be called the line 
of closest fit to the mass in the distribution ，when distances are measured 
along the axis of y t and the fit is judged according to the principle 
of least squares. 

In the case of a distribution such that the regression curve 
y = m, (a?) as defined bj (21.5.1) ezistfi, the expression E(rj — a — /?§)* 

18—4fi4 B. OramSr 273 




21.6 


may be written in the form 

E(tj — m f (©)* + 2 E[(rj 一 m, (5)) (m, (f) — a — /?£)] + JE»(m|(5) 一 a 一 J) 1 . 

By (21.4.7) and (21.4.8) the second term of this expression i8> howerer, 
equal to zero. Thus we obtain for any a and 

(21.6.6) JS (” 一 or 一 5)* ^ E(tj 一 wijCf))* + E(tn^(^) 一 a 一 fi J)*. 


Here, the first term in the second member is independent of a and /J, 
80 that the last term attains its minimum for the same values of a 
and p as the first member, i. e. for the values given by (21.6.2). Since 
w,(x) — a — is the vertical distance between the regression curve 
y == m # (a?) and the line y a + 存怎 ， it is thus seen that the m. sq. 
regfression line (21.6.4) may also be considered as the line of closest 
fit to the regression cui've y = m t (x), distances always being measured 
along the axis of y. It immediately follows that^ in a case when the 
regression curve y — m t (a?) is a straight line, this is identical with the 
m. sq. regression line (21.6.4). 

So far we have been concerned with the linear m. sq. regression 
of rj on In the converse case of the regression of § on we have 
to find the values of a and ^ that render the expression 


(21.6.7) 


— a — ptj) 9 



(x — a — 线 yY d P 


as small as possible. In the same way as above, we find that the 
problem baa a unique solution, and that the minimizing- straight line 
x=^a + py may be considered as the line of closest fit to the mags 
in the distribution, or to the regression curve (y), when dis¬ 

tances are measured horizontally, i. e. along the &zis of x. The equation 
of this line, the m. sq. regression line of 5, may be written 

( 21 . 6 . 8 ) 

a, P 


and the regression coefficient has the expression 


(21.6.9) 


A* 


Mu = gg| 


while the corresponding minimum value of the expression (21.6.7) is 

(21.6.10) J^min (§ — o 一 P'fjY = d} (1 — 

274 



Fig. 23. M. sq. regression lines. mx — m t = 0, a t — 1. a) p > 0, * b) p < 0. 

Both m. sq. regression lines (21.6.4) and (21.6.8) pass through the 
centre of gravity (m x ,m^. The two lines can never coincide, except 
in the extreme cases ^ = ± 1, when the whole mass of the distribu¬ 
tion is situated on a straight line (cf 21.2). Both regression lines 
then coincide with this line. 

Whea ^ = 0, the equations of the m. sq. reg^ression lines reduce to 
y = m t and x = m X) so that the lines are then parallel with the axes. 
This case occurs e. g. when the variables ? and tj are independent 
(cf 21.2 and 21.7). 

If the variables are standardized by placing the origin in the 
centre of gravity and choosing and <7， as units of measurement for 
5 and rj respectively, the equations of the m. sq. regression lines 
reduce to the simple form y = qx and y = x/q. When q is neither 
zero nor ±1， these lines are disposed as shown by Fig. 23 a or 23 b, 
according as ^ > 0 or ^ < 0. 

If, instead of measuring the distance between a point and a straight line in the 
direction of one of the coordinate axes, we consider the ah or test, i. e. the orthogonal 
distance, w© obtain » new type of regression lines. Let d denote the shortest distance 
between the point rj) and a straight line L. If L is determined such that E(d*) 
becomes aa small as possible, we obtain the orthogonal m. sq. regrestion line. This 
is the line of closest flt to the ({, ^distribation, when distances are measured 
orthogonally. 

Now JB(d*) may be considered as the moment of inertia of the mans in the distri¬ 
bution with respect to L. For a given direction of L, this always attains its mini¬ 
mum when L passes through the centre of gravity. We may thus write the «<ju 羹 - 
tion of L in (he form (§ 一 mj) ain y — （窄 一 m t ) cos 9 = 0, where <p is the angle 
between L and the positive direction of the ^-axis. The moment of inertia is then 

E (d*) — — m!) 8in <p — {ri — w*) cos q>) % 

=» sin 1 — 2 fin sin qp co* y + cos* (p. 

275 


21.6 


If, on each side of the centre of gravity, we mark on X ft segment of length inrersely 

proportional to ^ E (d 9 ), the locos of the end-points when <p varies is an ellipse of 
inertia of the distribution. The equation of this ellipse is easily found to b« 

(态一 Wj ) 1 — (态一 mxXrj —m t ) (”一 nt % ) % t 

一 o,v t 十 V] — ~〜• 


For varioas valaes of c we obtain a family of homothetic ellipses with the common 
centre (mi, m t ). The directions of the principal axes of this family of ellipses are 
obtained from the equation 


tg 2y 


^*0 ~ 


and the eqnations of the axes are 


( 21 . 6 . 11 ). 


^ Mil 


Mto- fht± ^*o - Poi)* + 4 4 


(5 一 w i). 


Here, the upper sign corresponds to the major axis of the ellipM and thus to the mini, 
mum of i. e. to the orthogonal m. sq. regression line. Id the case 


f^n = 一 #o* = 0 

the problem is undetermined; in all other cases there ia a unique solution. 


The parabolic m. sq. regression of order w > 1 forms a generaliza¬ 
tion of the linear m. sq. regression. We here propose to determine a 

polynomial g (f) = /? 0 ^ - + such that the mean value M — 

=E(rj-— g (J)) f becomes as small as possible. The curve y = g[x) is 
then the n:th order parabola of closest fit to the mass in the distri¬ 
bution, or to the regression curve y — m t (x). 

Assuming that all moments appearing in our formulae are finite ， 
we obtain the conditions for a minimum: 

~ 迟【专一 W) 】 A) a ，。+ • + ^n«»+«,o ~ a vX = 0 

for y = 0, 1,. .., n. If the moments are known, we thus have 
w + 1 equations to determine the « 4* 1 unknowns /? 0> ..., /? n . 

The calculations involved in the determination of the unknown 
coefficients may be much aimplifled，if the regression polynomial g(x) 
is considered as a linear aggregate of the orthogonal polynomials p v (a;) 
associated with the marginal distribution of For all orders such 
that these polynomials are uniquely determined (cf 12.6), we have 

00 

(21.6.12) E(p m (5)p n (§)) = f Pm(x)p n (x) dF^x) 


{ 1 for m = n, 
0 for m 〆 w, 


276 



21.6-7 


where p n (x) is of the w:th decree, and F x (x) denotes the marginal 
d. f. of Any polynomial g(x) of degree n may be written in the 
form 

9(x) — C 0 p 0 (z) -f + CnPn(x) 

with constant coefficients c 0> ..., c„. The conditions for a minimum 
now become 

(21.6.13) ~~ = E[p v (^)(g (?) — i?)] = c v — E(rjp v {^)) = 0. 

Hence we obtain Cv = E(rj p v (?)), so that the coefficients are ob¬ 
tained directly, without first having to solve a system of linear 
equations. It is further seen that the expression for c v is independent 
of the degree v. Thus if we know e. g the regression polynomia 】 
of degree n } and require the corresponding polynomial of degree n-f 1, 
it is only necessary to calculate the additional term c n +ip n +i (oc). 一 
Introducing the expressions of the c v into the mean value M, we find 
for the minimum value of M 

(21.6.14) jEtmn (rj — <J ©)* == •£(”*) 一 Co — — Cn. 

It should finally be observed that it is by no means essential for 
the validity of the above relations that the p v (x) are polynomials. 
Any sequence of functions satisfying the orthogonality conditions 
(21.6.12) may be used to form a m. sq. regression curve y = g(x) 

= YiC v p v (x), and the relations (21 6.13) and (21.6 14) then hold true 
irrespective of the form of the p v (x) 


21.7. The correlation coefficient. According to (21.2.8), the correla¬ 
tion coefficient ^ of 5 and rj is defined by the expression 

綷 ii 二 五[($ —叫）（”一％1】 
a i °2 VE(5 — m,)* E(rj — m f Y 


and we have seen in 21.2 that we always have — 1 ^ ^ ^ 1. The 
correlation coefficient is an important characteristic of the (J, ^-distri¬ 
bution. Its main properties are intimately connected with the two 
in. sq. regression lines 


y — x —— m* 

- - = 9 - » 

(21.7.1) a, a * 

y 一 wtj 一_ 1 x 一 triy 

Ot 一 p a v 

277 



21.7 


which are the straight lines of closest fit to the mass in the (5, rjY 
distribution, in the sense defined in the preceding paragraph. The 
closeness of fit realized by these lines is measured by the expressions 

/ 91 7 9) Eaxin a — 州 * = 4 ( 丄 一 〆 )， 

E min (^ — a — prj) f — a\(l — q 9 ), 

respectively. Thus either variable has its variance reduced in the 
proportion (1 一 〆 ）：1 by the subtraction of its best linear estimate in 
terms of the other variable. These expressions are sometimes called 
the residual variances of 7} and ^ respectively. 

When q = 0, no part; of the variance of rj can thus be removed 
by the subtraction of a linear function of 妄 ， and vice versa. In this 
case, we shall say that the variables are uncorrelated. 

When 泛 〆0 ， a certain fraction of the variance of r] may be re¬ 
moved by the subtraction of a linear function of and vice versa. 
The maximum amount of the reduction increases according to (21.7.2) 
in the same measure as q differs from zero. In this case, we shall 
say that the variables are correlated, and that the correlation is posi¬ 
tive or negative according as 纪 > 0 or ^ < 0. 

When q reaches one of its extreme values 土 1 ， (21.7 2) shows that 
the residual variances are zero. We have shown in 21 2 that this case 
occurs when and only when the total mass of the (5, i?)*distribution 
is situated on a straight line, which is then identical with both 
regression lines (21.7.1). In this extreme case, there is complete func¬ 
tional dependence between the variables : when 5 is known, there is 
only one possible value for rj, and conversely. Either variable is a 
linear function of the other, and the two variables vary in the same 
sense, or in inverse senses, according as ^ = -f 1 or ^ == 一 1. 

On account of these properties, the correlation coefficient q may 
be regarded as a measure of the degree of linearity shown by the 
(5, ^-distribution. This degree reaches its maximum when ^ = ± 1 
and the whole mass of the distribution is situated on a straight 
line. The opposite case occurs when ^ = 0 and no reduction of the 
variance of either variable can be effected by the subtraction of a 
linear function of the other variable. 

It has been shown in 21.2 that in the particular case when § and 
rj are independent we have q = 0. Thus two independent variables are 
always uncorrelated. It is most important to observe that the con¬ 
verse is not true. Two uncorrelated variables are not necessarily in¬ 
dependent. 


278 



21.7-8 


Consider, in fact, a one-dimensional fr. f. g(x) which differs from 
zero only when a; > 0， and has a finite second moment. Then 


/(* ， y) 


2 7ty X t + y» 


is the fr. f. of a two-dimensional distribution, where the density of 
the mass is constant on every circle «* + y* = c*. The centre of gra¬ 
vity is mj = w, = 0 , and on account of the symmetry of the distribu¬ 
tion we have fi n = 0, and hence ^ = 0. Thus two variables with this 
distribution are uncorrelated. However, in order that the variables 
should be independent, it is by (15.11.3) necessary and sufficient that 
f(x } y) should be of the form and this condition is not 

always satisfied, as will be seen e. g. by taking g (a;) = 

If q is the correlation coefficient of f and rj, it follows directly 
from the definition that the variables f = a 5 + 办 and rj f = ctj d 
have the correlation coefficient Q = Q sgn(ac), where sgn x stands for 
± 1 ， according as x is positive or negative. 

In the particular case of a discrete distribution with only two 
possible values (x,, x t and y t , y 8 respectively) for each variable, we 
find after some reductions, using the notations of 21 . 1 , 


(21.7.3) 


p2. p.\ P.2 


sgn [(^ — ^)(^ — 1/,)]. 


21.8. Linear transformation of variables. — Consider a linear 
transformation of the random variables ^ and ”，corresponding to a 
rotation of axes about the centre of gravity. We then introduce new 
variables X and Y defined by 


( 21 . 8 . 1 ) 


X = (5 — nij) cos + (rj — tn g ) sin 

Y = — (5 — >w,) sin g> + (rj — m t ) cos <p, 


and conversely 
( 21 . 8 . 2 ) 


f = wi, + X cos <p 一 y sin gp, 
17 = w, + X sin + Y cos q). 


If the angle of rotation q) is determined by the equation tg 2 

=- T —Ai — ， W e find 
MfO — Moi 

279 



21 . 8-9 


E(X r) = ^i n cos 2y — J (^,0 — ^ot) sin 2= 0, 

so that X and Y are uacorrelated. In the particular case = 
= 0, when the equation for q> is undetermined，we have E(X ¥) 
= 0 for any q>. Thus it is always possible to express § and tj as linear 
functions of two uncorrelaied variables. 

Consider in particular the case when the moment matrix M = 

|Mio ^n) j g 0 £ rank 1 (cf 21.2). We then have ^ ± 1, and the 

IMii A^oi) 

whole mass of the distribution is situated on the line tj 一 m t = 
(g — m,). Let us now determine the angle of rotation <p from the 

equation tg (p — . From (21.8.1) we then find 

E (r*) = o\ sin* gp — 2 p <Tj a, sin 9 ? cos (p + a\ cos* g) 

=(o^ sin tp — Qo t cob gp)* = 0. 

Thus the variance of Y is equal to zero, so that y is a variable 
which is almost always equal to zero (cf 16.1). If we then put r = 0 
in (21.8.2)，the resulting equations between rj and X will be satis¬ 
fied with a probability equal to 1. Thus two variables J and rj with 
a moment matrix M of rank 1 may、with a probability equal to be 
expressed as linear functions of one single variable. 

21.9. The correlation ratio and the mean square contingency. —— 
Consider two variables I and rj with a distribution of the continuous 
type, such that the conditional mean m t (x) is a continuous function 
of x. In the relation (21.6.6) we put a == m,, = 0, and so obtain 

(21.9.1) a\ — E(rj — m,)* = E(rj — m t ®)* + E(wi f (§)—• m,) f . 

We thus see that the variance of rj may be represented as the sum 
of two c(toponents, viz. the mean square deviation of rj from its con¬ 
ditional mea 朵 m % (?), and the mean square deviation of (J) from its 
mean m 2 . 

We now define a quantity 0^^ bj patting 


(21 9 2) = (S)-m t ) a 


C ； 


oo 

J (w, (x) - 


nt 1 ) i f l (x)dx. 


280 



21.9 


8^ is the correlation ratio 1 ) of 17 on f introduced by K. Pearson. In 
the applications we are usually concerned with the square 炉 ， and we 
may thus leave the sign of S undetermined. From (21.9.1) we obtain 

(21.9.3) 1 = 

and hence 

(21.9.4) 0 ^ e ;： ^ 1 . 

We further write the equation of the first m. sq. regression line 
(21.7.1) in the form y = a -f and insert these values of a and 〆 
in (21.6.6). Using (21.7.2) and (21.9.3), we then obtain after reduction 

(21.9.5) 心 =e* + ij5(m,(D-a- #}，. 

It follows that 0^ = 0 when and only when m % (x) is independent 
of x. In fact, when m t (x) is constant, the regression curve y = m t (x) 
is a horizontal straight line, which implies q — ^ = 0, and consequently 
^ = 0. The converse is shown in a similar way. 一 Further, (21.9.3) 
shows that 0^ = 1 when and only when the whole mass of the 
distribution is situated on the regression curve y = m t (x), so that 
there is complete functional dependence between the variables. For 
intermediate values of (21.9.3) shows that the correlation ratio 
may be considered as a measure of the tendency of the mass to accu¬ 
mulate about the regression curve. 

When the regression of p on ^ is linear, so that y = (a;) is a 

straight line, (21.9.5) shows that we have = g t i and (21.9.3) reduces 
fco the first relation (21.7.2). In such a case, the calculation of the 
correlation ratio does not give us any new information, if we already 
know the correlation coefficient g. 

In a case of non-linear regression, on the other hand, always 
exceeds 〆 by a quantity which measures the deviation of the curve 
t/ = m t (x) from the straight line of closest fit. 

The correlation ratio of 运 on is, of course, defined by in¬ 
terchanging the variables in the above relations. The curve y = m t (x) 
is then replaced by the curve a: = m, (y). 

For a distribution of the discrete type, the correlation ratio may 
be similarly defined, replacing (21.9.2) and (21.9.3) by 

*) In the literature, the correlation ratio is usually denoted by the letter rj, which 
obviously cannot be used here, since rj is a rftndom variable. 

281 



21.9 


(21.9.2 a) 0' 1( = ^E(m^ -m,Y == ^2 P-.W 0 -"»«)*, 
(21.9.3 a) 1— 邱 — ”# 1 )*， 

Ob 


where p,. and m ( 9 t} are defined by (21.1.2) and (21.4.3) respectively. 
The relations (21 9.4), (21.9 5) and the above conclusions concerning 
the properties of the correlation ratio hold true with obvious modi¬ 
fications in this case. 

The correlation coefficient and the correlation ratio both serve to 
characterize, in the sense explained above, the >'degree of dependence^ 
between two variables. Many other measures have been proposed for 
the same purpose. We shall here only mention the mean squat e con¬ 
tingency introduced by K. Pearson. Consider two variables rj with a 
distribution of the discrete type as defined by (21.1.1), and suppose 
that the number of possible values is finite for both variables. The 
probabilities p,k then form a matrix with, say, m rows and n columns. 
Since any row or column consisting exclusively of zeros may be 
discarded, we may suppose that every row and every column contains 
at least one positive element, so that the row sums p t , and the co¬ 
lumn sums k are all positive. The mean square contingency of the 
distribution is then 


(21.9.6) 


= 2 


(pxk—Vx.p.k? 


2 


pik 

Pt.P.k 


By (21.1.4), gp* = 0 when and only when the variables are independent. 
On the other hand, by means of the inequalities pa ^ . and pa ^ p.k 

it follows from the last expression that 炉 * S g — 1 ， where q = Min (w, n) 
denotes the smaller of the numbers m and n, or their common value 


if both are equal. Further, the sig^n of equality holds in the last rela> 
tion if and only if one of the variables is a uniquely determined 


function of the other. Thus 0 螽 - and the quantity 




5 


may be used as a measure, on a standardized scale, of the degree of 
dependence between the variables. 

In the particular case w = « = 2, we obtain after reduction 


(21.9.7) 


，二 ( 趴 1 灼，一 〜 PtxY 
V 一 Pi.PrP iP.j 
282 



21.9-10 


Thus in this case cp 1 is the square of the correlation coefficient q given 

by (21.7.3). We have here g = 2, so that identical with tp l . 

Farther, q> % assumes its maximum value 1 only in the two cases 
Pit = 1?,! = 0 or p n =ptt = 0. 


21.10. The ellipae of concentiatloP^ 一 Consider a one-dimensional 
Tandom variable § with the mean m and the s. d. a. If ^ is another 
variable which is uniformly distributed (cf 19.1) over the interval 
(m 一 aVdy m + aV^), it is easily seen that ^ has the same mean and 
s. d. as Thus the interval (m — a V3, m + a V^3) may be taken as a 
geometrical representation of the concentration of the ^-distribution 
about its centre of gravity m (cf also 16.6). 

We now propose to find an analogous geometrical representation 
of the concentration of & given two-dimensional distribution about its 
centre of gravity (m,, wi,). For this purpose, we want to find a curve 
enclosing the point (m,, m,) such that, if a mass unit is uniformly 
distributed over the area bounded by the curve, this distribution will 
have the same first and second order moments as the given distribu¬ 
tion. (By a »uniform distribution* we mean, of course, a distribution 
with a constant fr. f.) 

In this general form, the problem is obviously undetermined, and 
we shall restrict ourselves to finding an ellipse having the required 
property. In order to simplify the writing, we may suppose = m # =0. 
Let the second order central moments of the given distribution be 
fi 的、 fA n and We shall suppose that we have < 1, so that our 

distribution does not belong to the extreme type that has its total 
mass situated on a straight Hne. 

Consider the non-negative quadratic form 


= + a n ij\ 

By (11.12.3) the area enclosed by the ellipse q = c 9 is nc^/V^A, where 
A = a u Ofi — a\%. If a mass unit is uniformly distributed over this 
area, the first order momenta of the distribution will evidently be 
zero, while the second order moments are according to (11.12.4) 


4 A * 


£li 

4 A 


and 


^ On 
4 A 


It ic required to determine c and the oa tnch that these moments 

283 



21.10 


C 



Q = centre of gravity. QA = orthogonal m. sq. regression line. QB = m. sq. regres- 
sion line, ^ on ^ QC = m. sq regression line, ^ on rj 

coincide with Mu and ^ 02 respectively. It is readily seen that thia 
is effected by taking r* = 4, and 

where M — fj, i0 fz 0i — It will be seen that the form q (?, rj) thus 
obtained is the reciprocal (cf 117) of the form 

Q (I V) = f^to + 2^,^ + ^02 rf. 

.Returning to the general case of an arbitrary centre of gravity 
(m l5 w 2 ), and replacing the fia by their expressions in terras of o u a 8 
and Q y it thus follows that a uniform distribution of a mass unit over 
the area enclosed by the ellipse 

(21 10.1) 2 e(g-w,) (ri - m,) + (f)- =4 

1 一 〆 \ ol f 

has the same first and second ordei' moments as the given distribution• 
—This ellipse will be called the ellipse of concentration corresponding 
to the given distribution. 

The domain enclosed by the ellipse (21.10.1) may thus be regarded 
as a two-dimensional analogue of the interval (m — a V 3 , m -¥ a Vb). 
When two distributions in R t with the same centre of gravity arc 
such that one of the concentration ellipses lies wholly within the 
other, the former distribution wilJ be said to have a greater concentra- 
^ l0n than the latter This concept will find an important use in the 
theory of estimation (cf 32.7). 


284 



21 . 10-11 


If we replace the constant 4 in the equation (21.10.1) by an 
arbitrary constant c*，we obtain for various values of c* a family of 
homothetic ellipses with the common centre (m„ w,)，which is identical 
with the family of ellipses of inertia considered in 21.6. The common 
major axis of the ellipses coincides with the orthogonal m. sq. regres¬ 
sion line of the distribution (cf 21.6). The ordinary m. sq. regression 
lines are diameters of the ellipses, each of which is conjugate to one 
of the coordinate axes. The situation is illustrated by Fig. 24 

21.11. Addition of independent variables. — Consider the two- 
dimensional random variables x { — (S lt i^,) and x t = (^ 2 , rj t ). We define 
the sum x = x t + x 2 according to the rules of vector addition * 

* = il rj] = (^i + S«. Vi + Vtl 

By 14.5 ， * is a two-dimensional random variable with a distribution 
uniquely determined by the simultaneous distribution of x 1 and x g . 

Let us now suppose that x { and x t are independent variables ac¬ 
cording to the definition of 14 4, and denote bj (p [t, «), (t, u) and 

(t t u) the c. f.s of x, x L and x g respectively. By the theorem (16.3 4) 
on the mean yalue of a product of independent variables we then have 

(p (t 1 u) = E 

(21.11.1) 平 , ^ 

=• e l [t %) ) = gp, (#, u) (/, u) 

The generalization to an arbitrary number of terms is evident, and 
we thus obtain the same theorem as for one-dimensional variables 
(cf 15.12): The c.f. of a sum of independent variables is the product of the 
c.f:s of the terms. 

We shall now consider the case of a sum 4 - + x n , 

where the x v = (?*-, rjv) are independent variables all having the same 
two-dimensional distribution. We shall suppose that this latter distri¬ 
bution has finite moments of the second order fi i0i fi nt /i 02 , and that 
the first order moments are zero: = 0. If q)(t,u) is the c f. 

of this common distribution of the x v , we have by (21.3 3) 

(21.11.2) (p (t, m) = 1 — i (fif 0 -f- 2 fi lx tu -I- fl 02 m 2 ) + o(< 2 4 - M*) 

On the other hand, we have : = (匕 + 十 V\ + + rjn) and 

x __ + Vl ±_ + Vn\ 

d 一 I 


285 



21.11 


If q> n [t, m) is the c. £. of x/VS，it thus follows from the above that 
we have 


南卜卜(六，紐 


Substituting in (21 11.2) t/VH and u/VTi for t and u, we obtain 


9>n(ty n ) : 


一 ^ 80 ^ 4 - tu + fi w u 9 6 (n y t % «)1 W 

2 n n 


where, for any fixed t and m ，the quantity d («, t } u) tends to zero as 
m — oo. Hence we obtain, in the same way as in the proof of the 
Lindeberg-L4vy theorem in 17.4, 


(21.11.3) lim gp„(^w) = ^i ( 产《 ^+ 〜 "“+ 〜必 ) 

«-*oo 

Thus q) n (^> u) tends for all t and u to a limit which is obviously con¬ 
tinuous for (t, u) = (0, 0). By the continuity theorem for c. f:s proved 
in 10.7, we may then assert that this limit is the c. f. of a certain 
distribution which in its turn is the limit, for « -> oo, of the distri¬ 
bution of the variable x/Vn. 

This if x t , . . . are independent two-dimensional variables, all 
having the same distribution with finite second order moments and first 

order moments equal to zero, the distribution of the variable X - 土 - + ^ 

V n 

always tends to a limiting distribution as w and the c.f. of the 

limiting distribution is given by the second member of (21.11.3). — Ex¬ 
cept the trivial restriction = wi 2 = 0, this is the two-dimensional 
generalization of the Lindeberg-L^vy theorem of 17.4. 

It should be observed that, with respect to the second order mo¬ 
ments, we have here only assumed that these are finite. Now, given 
any quantities ft iQ , and ^ 08 such that the quadratic form 

+ + u l 

is non-negative, it is possible (cf 21.2) to find a distribution with 
w?! = = 0 and the given quantities for their second order moments. 

Taking this distribution as the common distribution of the x v in the 
above theorem, it follows that the expression in the second member 
of (21.11.3) is always the c. f. of a certain distribution, as soon as 
the quadratic form within the brackets is non-negative. If x is a 

286 



21.11 — 12 


variable having this c.f.，and if m = (m,, w? 2 ) is a constant vector ， 
the variable m + x has the c. f. 

(21.11.4) e 备〜 〆 ‘3/4, 

The distribution corresponding to this c. f. is the two-dimensional normal 
distribution, which will be further discussed in the following paragraph. 


21.12. The normal distribution. 一 We now proceed to study the 
distribution corresponding to the c. f. (21.11.4). We shall have to 
distinguish two cases according as the non-negative quadratic form 

Q(t,u) = + 2fi n tu + iu 02 u 2 

is definite or semi-definite positive (cf 11.10). In the former case, we 
shall say that we are concerned witb a non-singular normal distHbu- 
tion, whereas in the latter case we have a singular normal distribution. 
When we use the expression normal distribution without specification ， 
it will always be understood that we include both kinds of distributions. 

We shall first consider the case of a definite positive form Q (t t u). 
Then the reciprocal form Q^ x (x f y) exists and has the expression (cf 
21 . 10 ) 

y) — 一 2 丨 y + f^to y 8 

1 — Q Ui (ho 1 , ai) 


where M = fi 0% 一 = oj a}(l 一 〆)• From (11.12.1 b) we now obtain 

oo eo 

J f e 1 + «»)-i dy = 2 n »), 

— 00 —OO 

or, substituting a? — m x for x and y — for y, 


00 00 

^ - —=== J j 〆(<*+« y 、 一 1« 一 1 (*- 败 ” y 一叫 ） = #(»«, to 一 


The last relation shows that the function 

(21.12.1) f (a:, y) - - -f-i Q~ x (尤一供 i.y-w，) 

2 n a t a t V 1 一 q 1 
287 



21.12 


ts a two-dimensional fr. f. with the c.f. 

(21.12 2) <p[t, = 

The development (21.3.2) for the c.f shows that the quantities m, 
and fiu have, for this distribution, their usual signification as mean 
values and second order central moments. The function /(a?, y) defined 
by (21.12.1) is the normal fr.f. in two variables. It has a maximum 
point at the centre of gravity (m lf m^. ^he nomothetic ellipses 

(2 , io 3 i 一. — 2Q(x-m,)(«-rn,) (y - >n t )»\ = s 
US) 2(l- e *)\ a? a l0i + al ) 

that have already appeared in 21.6 and 21.10 in connection with the 
ellipses of inertia and of concentration of an arbitrary distribution ， 
play in the case of a normal distribution the further role of equi- 
probability curves. For any pqint belonging to (21.12 3) we have, in 

fact, f(x, y) = - c -0 *. Since by (11.12.3) the area of the 

2 7ta i a i v \ — Q* 

ring between the ellipses corresponding to c and c + dc is 

4 V1 — 〆 cdc 、 

the mass situated in this ring is 2 ce _c *cic, and thus the mass in the 
whole plane outside the ellipse (21.12.3) is (cf Ex. 15, p. 319) 

OQ 

f 2ce~ c 'dc = e _t? . 

c 

The form of the equi probability ellipses (21.12.3) gives a good idea 
of the shape of the normal frequency surface z — f(x, y). For p = 0. 
o { == o it the ellipses are circles. As (j approaches 十 1 or —1， the 
ellipses become thin and needle-shaped, thus showing the tendency of 
the mass to accumulate towards the common major axis of the el¬ 
lipses, which is the orthogonal in. sq. regression line (cf 21.6) of the 
distribution. 

A variable (f, rj) with the fr.f. (21.12.1) is said to possess a von- 
singular normal distribution. The c. f of the marginal distribution of 
5 is then by (21 3.4) 

Thus by 17 2 ^ is normal (m iy <7,), with the marginal fr.f. 

288 




21.12 


1 ( g -供 •)* 


By index permutation we obtain the corresponding expression for the 
marginal fr. f. f t (y) of rj. 

In the particular case when 泛 = 0， it is seen that we have 
y) = f x [x)f t (y), which implies that the variables are independent. 
For the normal distribution, it is thus legitimate to assert that two 
non-correlated variables are independent, though we have seen in 21.7 
that for a general distribution this may be untrue. 

The conditional fr f of rj, relative to the hypothesis 5 = is by 


(21 4 6) 

(21 12.4) f(y\x) = 


/(f.y) _ i — ( 
?I (*) _ a, V 2 n (r 17 "<?*) 


(1 






This is a normal fr. f in y f with the m^an 


m t (x) = w t J r 



and the s. d. o t V\ — q*. Thus the regression of rj on ? * s linear, and 
the conditional variance of rj is independent of the value assumed 
by 5- — The analogous properties of the conditional distribution of § 
for a given value of r) are deduced in the same way. 

When the non-negfctive form Q(t, u) is semt-definite, the determinant 
M is zero, and no reciprocal form exist* (cf 11.7 and 11.10) It fol¬ 
lows, however, from the preceding paragraph that the expression 

(21.12.2) is still the c. f. of a certain distribution, and this will be 
called a singular normal distribution. By 21.2, the total mass of this 
distribution is situated in a single point or on a straight line, ac¬ 
cording as the rank of the moment matrix M i® 0 or 1. 

In such a case, it is evident that no finite two-dimensional fr. f 
exists. Still, a singular normal distribution may always be regarded 
as the limit of a sequence of non-singular normal distributions. In 
order to see this, we may consider the sequence of non-singular nor¬ 
mal distributions corresponding to the given values of and m 2 , and 
the sequence of definite positive form? Q y (t, u) = Q (f, u) + el(t f + M f ), 
where 0. The corresponding c. f:s tend, of course, to the limit 

(21.12.2) , and by the continuity theorem 10.7 the non-singular distri- 
bations then tend to the giren singular dittribution. 

289 




21.12 


Consider a singular normal distribution with a moment matrix M 
of rank 1. By 21.8，the corresponding variables $ and rj may, with a 
probability equal to 1, be represented as linear functions of a single 
variable X. Conversely, X is a linear function of ^ and ”，and the 
c. f. of X is then of the form 卜 i。”*, so that X is normally distri¬ 
buted. The case when M is of rank 0 may be regarded as the limiting 
case ex == 0, and we thus have the following result: 

A two-dimensional singular normal distribution may be regarded as 
an ordinary one-dunensional normal distribution on a certain straight line 
tn the plane. 


When m, = m s = 0, we obtain from (12.6 8) the following expansion of the nor¬ 
mal fr f in powers of p: 


(21.12.6) 


y ) : 


2 7ia, <r a 


l /** 2 y y* \ 

丨 __ 丨 ■ - I — mmm + —* 1 

1—p*) n t a\f _ 




P v . 


The series may be integrated term by term, and we deduce a oorresponding expression 
for the normal d f. 


(, 21.12 0 ) 


V 00 4^) \ 0 (*) 

/(u, v)dti dv ^ 2 -—— —* - if v 


v» 


For r — y = 0 we obtain from (21 12.5) 

r 

and hence by integration with respect to q 






arc sin q. 


No 、、 (21.12.6) gives 


0 0 i 

f f /(M, v)dit i + 2 ^ arc 8in ^ 


By the ftymmetry properties of the fr. f. f (x, y), it then follows that in each of the 
first and third quadrants of the (x, ^>plane we have the mass i + arc ain p ， 

while each of the second and fourth quadrants contains the mass J 一土 arc 8in p. 
These relations are due to Stieltjes, Kef. 220, and Sheppard, Ref. 211. 

290 



22.1 


CHAPTER 22. 


General Properties of Distributions in R n . 

22.1/ Two simple types of distributions. Conditional distributions. 
一 The joint probability distribution (cf 14.2) of n one dimensional 
random variables $" •. • ，客 n is & distribution in the n dimensional space 
Jin, with the variable point * = (g!, •. •, $ n ). 

The probability function (cf 8.4) of the distribution is a set func¬ 
tion P(5) = P(x< S) t which for any set S in Rn represents the prob¬ 
ability of the relation x<S. The distribution function^ on the other 
hand, is a function of n real variables defined by the relation (8.3.1). 

•••，》») = P(Si ^ ajj, . . ^ Xn). 

The distribution is uniquely defined by either function P or F. 

As before, we shall make a frequent use of our mechanical illustra¬ 
tion, interpreting the probability distribution by means of a distribu¬ 
tion of a unit of mass over R n . If we pick out a group of k variables 
$•,，•••，and project the mass in the original n-dimensional distri¬ 
bution on the ^-dimensional subspace of these variables, we obtain 
(cf 8.4) the k-dimensional marginal distribution of The 

corresponding marginal d. f. is obtained, as in the two-dimensional 
case, by putting the n 一 k remaining 1 variables in F equal to + oo. 
Thus in particular the marginal d. f. of the single variable is 
(®) = oo), and similarly for any 

As in the cases w = 1 and w = 2 (cf 15.2 and 21.1)，we now intro¬ 
duce the two simple types of distributions - the discrete and the con¬ 
tinuous type. The definitions and properties of these are directly 
analogous to those given in 21.1, and we shall here only add some 
brief comments. 

For a distribution of the discr ete type, we have on the axis of each 
a finite or enumerable set ot points which are the 

discrete mass points of the marginal distribution of The total 
mass of the n dimensional distribution of * =( 匕 , • • . ， f n ) is then con¬ 
centrated in the discrete points (xu ty .. ar n » n ), each of these points 
carrying a mass fn S 0, so that 

= ^ 1 * 1 ，-.., In ==a?nr n ) =!?»,. , n , 


291 



2t.I-2 


The marginal distribution of any group of k variables is also of the 
discrete type, and the corresponding ： p：B are obtained in a similar 
way as in (21.1.2) and (21.1.3), by summing over all values of 

the n — k remaining variables. 

For a distribution of the continuous type, the d. f. F is everywhere 
continuous, and the probability density or frequency function (cf 8.4) 

F 

exists and is continuous everywhere, except possibly in certain points 
belonging to a finite number of hjperaurfaces in R n . The differential 
f(x it .. y x n ) dx t ... dx n will be called the probability element (cf 15.1) 
of the distribution. The fr. f. of the marginal distribution of any 
group of k variables is obtained by integrating f(oc u . .ar n ) with 
respect to the n — k remaining variables, as shown for the two- 
dimensional case by (21.1.5) and (21.1.6). 

When have a distribution of the continuous type, the 

conditional f?.f. of &，...，§*， relative to the hypothesis |*+i = 
Xk-hiy . .. ^ n == x ny is given by the expression generalizing (21.4.10): 


( 22 . 1 . 1 ) 


f » • • • > I 1 ， . • . ， ~~ 

=_ f(x x> . . .,X n ) _ 

~~ oo oo 

f f m • j bh > 1) . ，尤 n) d i . . d ^1( 


Finally, let us consider two variables * = d . . • ，客 m) and y = 
(” 1 ， ... ， ^n) such that the (m + w)*dimensional combined variable (x,y) 
has a distribution of the cyitinuougjxpe^ Jn generalization of (21.1.7) 
we then find that a necessary and sufficient condition for the inde¬ 
pendence of x and y is 

(22.1.2) f(x iy . , ar m ,y,, . ， y„) =/ x (x u . . x 供、 f t (y 、 ， y»), 

where f ; f t and /, are the fr. f s of (*, y), * and y respectively. The 
generalization to any number of variables x, y, . . . is immediate. 


22.2. Change of variables in a continuous distribution. — Let 

* == (?i». • • > §») be a random variable in R n> and consider the m functions 

(22.2.1) rj, = <；*(§!， . .?»»), (t = l,2，.. , tw), 

292 



22.2 


where m is not necessarily equal to n. According to 14.5, the vector 
y == (t；!,. ..rjm) then constitutes a random variable in a space R m of 
m dimensions, with a probability distribution uniquely determined by 
the distribution of x 

Wfe shall here only consider the particular case when m = and 
the jr disfcribution belongs to the continuous type. If the functions 
g t satisfy certain conditions, the y-distribution may then be explicitly 
determined, as we are now going to show 

Let us assume that the following conditions A) and B) are satisfied 
for all x such that the fr f. / (x x , , x n ) is different from zero. 


A) The functions g t are everywhere unique and continuous, and 

in ail points x, except possibly 


have continuous partial derivatives 


dS k 


in certain points belonging to a finite number of hypersurfaces 

B) The relations (22 2 1)，where we now take m == n ，define a one- 
to-one correspondence between the points « = (&， .，？”）and y — 
(rji , i? n ), 80 that we have conversely = h, (rj l , ， rj n ) for i — 

1 ， . , n, where the h t are unique 

Consider a point x which does not belong to any of the exceptional 


hypersurfaces, and is such that the Jacobian 




化， 




d r)t 
dh 


is different from zero 

4， 

to x, since we have 


J 


The Jacobian of the inverse transformation, 
— J j is then finite in the point y corresponding ： 


❻ 0 ( 4 , 


When S is a sufficiently small neighbourhood of x, and T is the 
corresponding set in the y-space, J is finite for all points of T, and 
we have 

(222.2) P(S)=ff(x l , .., x tl ) d x { . d x n =f ,.. , x n )\J\difi dy n 

s 'I 

where in the last integral the x t should be replaced by their expres¬ 
sions x t = h t (y,, . , y n ) in terms of the y t . 

The probability element of the x distribution is thus transformed 
according to the relation 

(22.2.3) f(x l ,. .,x n )dx l . .. . , x a )|J|<iy, . dy n , 

293 



22.2-3 


where in the second member x % = h t {y l ,. .., t/»). The fr. f. of the new 
variable y ==(”". •., rj n ) is thus /(x,,.. : r”)|J|. 

When w = 1, and the transformation i? = #7 (f) or | = h (rj) is unique 
in both senses, (22.2.3) reduces to 

f{x)dx =/[A(jf)]|A ， {v)\dy, 


where the coefficient of dy is the fr. f. of the variable rj. An example 
of this relation is given by the expression (15.1.2), which is related 

f) — \) 

to the linear transformation rj = a5 + b ， or 各=一 . 


Suppose now that the condition B is not satisfied. To each point x, there still 
corresponds one and only one point y, but the converse transformation is not unique 
to a given y there may correspond more than one We then have to divide the 
«.space in several parts, so that in each part the correspondence is unique in both 
senses. The mass carried by a set T in the y space will then be equal to the sum 
of the contributions arising from the corresponding sets in the various parts of the 
x-space. Each contribution is represented by a multiple integral that may be trans¬ 
formed according to (22 2 2), and it thus follows that the fr f of y now assumes the 
form I where the sum is extended over the various points x corresponding 
to a given y, and f v and J v are the corresponding values of /(t,, .. ,x w ) and J. 

In the case n — 1, an example of this type is afforded by the transformation 
tj = considered in 16.1. The expression (16 1 4) for the fr. f. is evidently a special 
rase of the general expresMton S/ r | t / r |. — A more complicated example will occur 
in 29.3. 


22.3. Mean values, moments. — The mean value of a function 
f/ (5i»• ♦ • i 5n) integrable over R n with respect to the w-dimensional pr. f. 
P(<S) has been defined in (15 3.2) by the integral 

= / g(x l} . . dP. 

The moments of the distribution (cf 9.2 and 21.2) are the mean 
values. 

(22.3.1) K ndP < 

where v v 4 - - -f v n is the order of the moment. For the first order 
moments we shall use the notation 

m, = E(^i) == f x t d P. 


294 



22.3 


The point m = (wi,,.. m n ) is the centre of gravity of the mass in the 
"•dimensional distribution. 

The central moments or the moments about the point m, 

are obtained by replacing in (22.3.1) each power by — w?,) n . 
The second order central moments play an important part in the sequel, 
and whenever nothing is explicitly said to the contrary, we shall 
always assume that these are finite. The use of the ^ notation for 
these moments would be somewhat awkward when n > 2, owing to 
the large number of subscripts required. In order to simplify the 
writing, we shall find it convenient to introduce a particular notation, 
putting 


(22.3.2) 


hi = a? = E (^ — m ,)*, 

= Qik Oi a k = 五 ((§* — wii)(5* — m k )). 


Thus In denotes the variance and a t the s d. of the variable £», while 
k,k denotes the covariance of and ^ The correlation coefficient 
又 * it 

Qik = - is, of course, defined only when and <r* are both positive 

O t Ok 

Obviously we have A*, = A,*, Qki = Qik and , = 1. —— In the par¬ 
ticular case n = 2, we have A u = jU 20 , l lt = fi ni A 22 = fi 09 . 

In generalization of (21 2.5), we find that the mean value 


(22.3.3) 


E 




^ 2 义 1 * 匕 “ 


is never negative, so that the second member is a non-negative qua¬ 
dratic form in ， … ， f n . The matrix of this form is the niomrut 
matrix 


A = 



while the form obtained by the substitution t,==~ corresponds to the 
correlation matrix 


P = 


Qnl • • • Qnn 


which is defined as soon as all the <7 t are positive. 

295 



22.3-4 


Thus the symmetric matrices A and P are both non-negatiTe (cf 
11.10). Between A and P, we have the relation 

A^lPl 

where 1 denotes the diagonal matrix formed with or 1 ,. . ., as its 
diagonal elements. By 11.6, it then follows that A and P have the 
same rank. For the corresponding determinants A = |A,*| and P = |^*|, 
we have A = a\.. . a\ P. From (11.10.3) we obtain 

(22 3.4) 0^A^k n ... A n „, 0 < . . Qnn = 1. 

In the particular case when A,ji = 0 for i k y we shall say that 
the variables f,, • ，匕 are uncorrelated. The moment matrix A is 
then a diagonal matrix, and A — .. 又 nn . If, in addition, all the 

a t are positive, the correlation matrix P exists and is identical with 
the unit matrix /, so that P= 1. Moreover, it is only in the uncorre¬ 
lated case that we have A = A u .. Kn and P= 1. 


22.4. Characteristic functions. — The c. f. of the n dimensional 
random variable x = (^ Xi ..., ^ n ) is a function of the vector e = 
defined by the mean value (cf 10.6) 


g)(t) = E (〆’*) = 【 e^ t，x d P, 

where, in accordance with (11.2.1), t r x = t x l v ' + t n ^n The pro¬ 

perties of the c f. of a two-dimensional variable (cf 21.3) directly 
extend themselves to the case of a general n. In particular yve have 
in the neighbourhood of t == 0 a development generalizing (21.3.2) 


(22.4.1) 


tp (f) = m j 1 


4 奴 + 。(子 y )}. 


If m = 0, this reduces to 
(22.4.2) 


穸⑷ =1 —蚤 2 又 i + 0 m- 


The semi-invariants of a distribution in n dimensions are defined by 
means of the expansion of log tp in the same way as in 15.10 for 
the case w — 1. 

As in 21 B, it is shown that & necessary and sufficient condition 
for the independence of the variables * and y is that their joint o. f. 
is of the form q> («, u) = tp x (t) ip t (if). 

296 



M ,4 -勝 


The c. f. of the marginal distribution of any ^roup of h Tariablcs 
picked oat from $!，...,$«» is obtained from 炉⑷ by putting 知 》 =0for 
all the n — k remaining variables. Thus the joint c. f. of &,••• ， 5* i 讎 

(22.4.3) E (c^ … W) = y (G ,•••，〜, 0, •••，01 

22.5. Rank of a distribution. — The rank of a distribution in R u 
(Frisch, Ref. 113; cf also Lukomski, Ref. 151) will be defined as the 
common rank r of the moment matrix A and the correlation matrix 
P introduced in 22.3 The distribution will be called singular or rum- 
singular, according as r < n or r == w. 

In the particular case w = 2, A is identical with the matrix M 
considered in 21.2. It was there shown that the rank of M is directly 
connected with certain linear degeneration properties of the distribu¬ 
tion. We shall now prove that a similar connection exists in the 
case of a general n. 

A distribution in R n is non-singular when ami only ivhen there is no 
hyperplane in R n that contains the total mass of the distribution. 

Iv order that a distribution in R n should be of rank r, where r < n, 
it is necessary and sufficient that the total mass of the distribution 
should belong to a linear set L r of r dimensions % but not to any linear 
set of less than r dimensions. 

Obviously it is sufficient to prove the second part of this theorem, 
since the first part then follows as a corollary. We recall that, by 
3.4, a linear set of r dimensions in R n is defined by n — r indepen¬ 
dent linear relations between the coordinates. 

Suppose first that we are given a distribution of rank ;• < u. The 
quadratic form of matrix A 

(22.6.1) = 与 K “ = E (年妳 一 叫)) ■* 

is then of rank r, and accordingly (cf 11 10) there are exactly n — r 
linearly independent vectors t p = (t ( f\ . , t^) such that Q {t p ) = 0. For 

each vector (22.5.1) shows that the relation 

(22.5.2) 2 一 = 0 

must be satisfied with the probability 1. The n 一 r relations corres¬ 
ponding- to the n — r vectors t p then determine a linear set L r con¬ 
taining the total mass of the distribution, and since any vector t 

297 



22.5-6 


such that (e) ~ 0 must be a linear combination of the tp, there can 
be no linear set of lower dimensionality with the same property. 

Conversely, if it is known that the total mass of the distribution 
belongs to a linear set L r> but not to any linear set of lower dimen¬ 
sionality, it is in the first plac?e obvious that L r passes through the 
centre of gravity m, so that each of the n — r independent relations 
that define L r must be of the form (22.5.2). The corresponding set of 
coefficients then by (22 5 1) defines a vector t p such that ^(^,) = 0, 
and since there are exactly v — r independent relations of this kind, 
Q(t) is by 11.10 of rank r, and our theorem is proved. 

Thus for a distribution of rank r < w, there are exactly n 一 r 
independent linear relations between the variables that are satisfied 
with a probability equal to one. As an example we may consider the 
case w = 3. A singular distribution in jR 3 is of rank 2, 1 or 0 ， ac¬ 
cording as the total mass is confined to a plane, a straight line or a 
point, and accordingly there are 1， 2 or 3 independent linear relations 
between the variables that are satisfied with a probability equal to one 

22.6. Linear transformation of variables. 一 Let 匕 ， .• • ， f，》be 
random variables with a given distribution in R n , such that m = 0. 
Consider a linear transformation 

n 

(22.6.l) 7]i = Ctk (t- = 1, 2,.. , 7ii), 

k-i 

with the matrix C = C mn = where m is not necessarily equal 

to n. In matrix notation (cf 11.3), the transformation (22.6.1) is 
simply y = Cx. This transformation defines a new random variable 
y = (rj^ .. rjm) with an wi-dimensional distribution uniquely defined by 
the given w-dimensional distribution of x (cf 14 5 and 22 2). 

Obviously every rj t has the mean value zero. Writing == E{^, ^), 
= we further obtain from ( 22 . 6 . 1 ) 

n 

/^-ik ~ ^ lT Ckg. 
r,«=l 

This holds even when m 7 ^ 0, and shows that the moment matrices 
A A„n^ { 又 “} and M = M m ni =■= satisfy the relation 

(22.0.2) M=CAC\ 

If, in the c. f. q>[t) of the variable we replace t u .. ，心 by new 

298 



22.6 


variables u lf ... y ti« by means of the contragpredient transformation 
(cf 11.7.5) t = C'u, we have by (11.7.6) t'* = u y, and thus 

(22.6.3) g>(t) = E (#*’*) = JS(e< B ’y) = t/; (a), 

where xp(u) == t^(M„ ..., u m ) is the c. f. of the new variable y. 

From (22.6.2) we infer* by means of the properties of the rank of 
a product matrix (cf 11.6), that the rank of the y-distribution never 
exceeds the rank of the x distribution. 

Consider now the particular case w =», and suppose that the 
transformation matrix C == C» n is non-singular. Then by 11.6 the 
matrices A and M have the same rank, so that in this case the 
transformation (22.6.1) does not affect the rank of the distribution. 一 
Let us，in particular, choose for C an orthogonal matrix such that 
the transformed matrix M is a diagonal matrix (cf 11.9). This im¬ 
plies fi t k = 0 for i k % so that ij lt … ， t] n are uncorrelated variables 
(cf the discussion of the case n = 2 in 21.8). In this case, the reci¬ 
procal matrix C - * 1 exists (cf 11.7), and the reciprocal transformation 
* = C 一 1 y shows that the 5* may be expressed as linear functions of 
the rji. If the *-distribution is of rank r, the diagonal matrix M 
contains exactly r positive diagonal elements, while all other elements 
of M are zeros. If r < w, we can always suppose the rj t bo arranged 
that the positive elements are fi ny • • • ， ftr r . For t' = ，•十 1 ， ... ， w, we 
then have E(rjJ) = 0, which shows that rj, is almost always equal 

to zero. Thus we have the following generalization of 21.8: 

If the distribution of n variables fi，..., is of rank r, the St may 
with a probability equal to 1 be expressed as linear functions of r un¬ 
correlated variables r } x> . .^ r . 

The concept of convergence in probability (cf 20.3) immediately ex¬ 
tends it9eif to multi dimensional variables. A variable * = (§ lt •. 
is said to converge in probability to the constant vector a = (a t ,.. 
if Si converges in probability to a t for t = 1,..w. We shall require 
the following analogue of the convergence theorem of 20.6, which 
may be proved by a straightforward generalization of the proof for 
the one-dimensional case: 

Suppose that we have for every v = =1, 2,•… 

y v = A x v 

where x v , y v and m v are n-dimensional random variables y while A is a 
matrix of order n * n with constant elements. Suppose further that, as 

299 



23 . 4-7 


♦ «o f the n-dimensional distribution of x v tends to a certain 
dutribution, while »• convei^ges in probability to zero. Then y* has th 
limiting distribution defined by the linear transformation y = Ax } wher 
» has the limiting distribution of the x v . 

22 J. The ellipsoid of concentration. 一 The definition of the ellips 
of concentration given in 21.10 may be generalized to any mimbc 
of dimensions. Let the variables ..., hare a non-singular distribi 
tion in R n with m = 0 and the second order central moments A, 
and consider the non-negative quadratic form 

g(Si> … ， 6») = 2 aik ^ k - 

Kk 

If a mass unit is uniformly distributed (i. e. such that the fr. f. 
constant) over the domain bounded by the w-dimensional ellipso 
q == c 1 , the first order moments of this distribution will evidently 
zero, while the second order moments are according to (11.12.4) 

a TTl If (… 以 ••，”)_ 

It iB now required to determine c and the a t t such that these n 
ments coincide with the given moments A ( ». It is readily seen tl 
this is effected by choosing, in generalization of 21.10, c* = « + 2 a 


Att An 



Thus the ellipsoid 

(22.7.1) g(l. ， … ， f»)=2 f $ i ^^ =：w + 2 

i.k 

has the required property. This will be called the ellipsoid of < 
centration corresponding to the given distribution, and will serve i 
geometrical illustration of the mode of concentration of the distri 
tion about the origin. The modification of the definition to be m 
in the case of a general m is obvious. When two distributions y 
the same centre of gravity are such that one of the concentral 
ellipsoids lies wholly within the other, the former distribution wil 
said to have a greater concentration than the latter. 

The quadratic form q appearing in (22.7.1) is the rcciproca 
the form 


300 



22.7-23.1 


d •••，&*) = 2 又“公心. 

r,jt 


(Since 2 is a symmetric matrix, we may replace Au by Aa in the 
elements of the reciprocal matrix as defined in 11.7.) 

The n-dimeosional volume of the ellipsoid (22.7.1) has by (11.12.3) 
the expression 


r (l + 1 ) 


n * 

(n + 2) t /r* 



o x . . .a n V P y 


where the determinants A = | 又 a | and P = | | are both positive, 

since the distribution is non-singular. When . . .,a n are given, it 
follows from (22.3.4) that the volume reaches its maximum when the 
variables are uncorrelated (P = 1)，while on the other hand the volume 
tends to zero when the Q t k tend to the correlation coefficients of a 
singular distribution. The ratio between the volume and its maximum 
value is equal to VP; this quantity haa been called the scatter 
coefficient of the distribution (Frisch, Ref. 113). It may be regarded 
as a measure of the degree of »non singularity* of the distribution. 
一 For n = 2, we have VlP = V l — 

On the other hand, the square of the volume of the ellipsoid is 
proportional to the determinant A = a\ ... al P 、and this expression 
has been called the generalized variance of the distribution (Wilks, 
Ref. 232). For n = 1, reduces to the ordinary rariance a 9 , and for 
n = 2 we have A = a\a\{\ — 

We finally remark that the identity between the homothetic families 
generated by the ellipses of concentration and of inertia, which, has 
been pointed out in 21.10 for the two dimensional case, breaks down 
for n >2. 


CHAPTER 23. 

Regression and Correlation in n Variables. 

23.1. Regression surfaces. 一 The regression curves introduced in 
21.5 may be generalized to any number of variables, when the distri¬ 
bution belongs to one of the two simple types. Cousider e . g . n vari¬ 
ables §|, . . with a distribution of the coutisuous type. The cow- 

301 



23.1-2 


ditional tnean value of relative to the hypothesis §(== x ( for 
t = 2,. .. , is 

O0 

J • • » 3?n) d.Xi 

E(^ t I == X n ) = (X t} . . a? n )= 二 ^- 

. . •，戈 n) dXi 

— oo 

The locus of the point (m„ : c，，. . . ， x») for all possible values of a? 8 ,.. x n 
is the regression surface for the mean of and has the equation 

x x = m x (x %> . . x n \ 

which is a straightforward generalization of (21.5.2). 


23.2. Linear mean square regression. — We now consider n vari¬ 
ables 5i* •••，？》with a perfectly general distribution, such that the 
second order momenta are finite. In order to simplify the writing, we 
shall further in this chapter always suppose m == 0. The formulae 
corresponding to an arbitrary centre of gravity will then be obtained 
aim ply by substituting &一 nu for in the relations given below. 

The mean square regression plane for with respect to 匕 ， • . .， 
will be de6ned as that hyperplane 

(23.2.1) £i ** fta • 34 n?| + ^13-34 . n?a + . • + ^ln -23 i»-l §n 

which gives the closest fit to the mass in the ^ dimensional distribution 
in the sense that the mean value 

(23.2.2) ^12-84 .n$i - —pin 23 . »-l §n)* 

ifl as small &s possible. Thus the expression on the right band side 
of (23.2.1) is the best linear estimate of in terms of 匕 ， •••，$”，in the 
sense of minimizing (23.2.2). We may here regard ^ in¬ 

dependent variables, and ^ as a dependent variable which is approzi- 
matelj represented, or estimated, by a linear combination of the in¬ 
dependent v&ri&ble 鼸 . 

In a similar way we define the m. sq. regression plane for any 
other variable 5“ in which case of course takes the place of the 
dependent variable, while all the rcmftiuing variables Si, • • 5 卜 i, 

$<+ 1 . •••>$» Are regarded &8 independent. 

For the regression coefficients 1 ) ^ we have here used a notation 

*) OfUn also called partial regression cot/ficienU. 

302 



23.2 


introduced by Yule (Ref. 251). Each has two primary subscripts 
followed by a point, and then n — 2 secondary subscripts. The first 
of the primary subscripts refers to the dependent variable, and the 
second to that independent variable to which the coefficient is at¬ 
tached. Thus the order of the two primary subscripts is essential. 
The secondary subscripts indicate, in an arbitrary order, the remaining ： 
independent variables. 一 Sometimes, when no misunderstandingr seems 
possible, we may omit the secondary subscripts. 

In order to determine the regression coefficients, we differentiate 
the expression (23.2.2) with respect to each of the n 一 1 unknown 
coefficients and then obtain the n 一 1 equations 

又 M A* + 又 M + ' + 又 2n#ln = 又《， 

^8f Pi% + ^»S As + * * * + ^Snftn = 

Xu aft 釁 + 又 n3 A# + + ^mnPln ― iftl» 


where we have omitted the secondary subscripts, thus writing /?“• 
instead of the complete expression . . fc-i,k+i n. The deter¬ 

minant of this system of equations is A n , the cofactor of A n in the 
determinant J = | 又 “ j. 

Let us first suppose that the * distribution is non-singular (cf 22.5). 
The moment matrix A and the correlation matrix P are then definite 
positive, so that A n > 0, and by (11.8.2) our equations have a unique 
solution 


(23.2.3) 


— A\k _Oj Pik 

■^n a* P a 


Bj simple permutation of indices we obtain the corresponding ex¬ 
pression 


(23.2.4) 


— Aik 一 — Pn 
An~^ Ok Pa 


for the coefficient pn in the regression plane for The omitted 
secondary subscripts are here, of course, all the numbers 1, 2, . . n 
with the exception of t and Ic t while the Aa and Pa are cofactors 
in A and P. 

In a non-singular distribution, the regression plane for each variable 
with respect to all the others is thus uniquely determined, and the re¬ 
gression coefficients are given by (23.2.4). 一 In the particular case of n 

303 




uncorrelated variables，it follows that all regression coefficients are zero, 
iinee tee have A t k = 0 for i 7 ^ k. 

Suppose now that the ^-distribution is singular, with a rank r < n. 
We then may have = 0, and accordingly some regression coef¬ 
ficients may be infinite or undetermined. As an example, we may con- 
iider the case w = 3. For a distribution of rank 2, the total mass is 
situated in a certain plane. As long as this plane is not parallel to 
one of the axes, ifc is then obvious that all three regression planes 
will coincide with this plane, so that all regression coefficients are 
finite and uniquely determined. If, on the other hand, the plane is 
parallel to one of the axes, e. g. the axis of the two-dimensional 
marginal distribution of and will have its total mass confined 
to a straight line. Now the moment matrix of this marginal distribu¬ 
tion has the determinant A lu and thus we have A n = 0. In this 
case, we may say that the regression plane for is parallel to the 
axis of Si* 80 that at least one of the regression coefficients fin . s and 
is infinite. 一 For a distribution of rank 1 or 0, on the other 
hand, the total mass belongs to a certain straight line or to a certain 
point. Each regression plane must then contain this line or point, 
but is otherwise undetermined. 

As in 21.6, we can show that the m.sq. regression plane (23.2.1) is 
•Uo the plane of closest fit to the regression surface x x = m x [x v . . Xn), 
for all distributions sach that the latter exists. If it is known that 
the regression surface is a plane, this plane must thus be identical 
with the m. sq. regression plane. 

Consider next a group of any number h < n oi the variables taj 
&，. . .，？<?. The /i-dimensional marginal distribution of these vari¬ 
ables has a moment matrix which is a certain submatrix A* of A. 
We can then form the regression plane of ^ with respect to & ， •.. ， 匕， 
and the regression coefficients will be given by expressions analogous 
to (23.2.4), where An and Atu are replaced by the correiponding ： co¬ 
factor* from the determinant A* = \A* |. 一 If, in particnlar, we 
consider the group of the w —1 variabies &+i, . . ^ 

we obtain 

(23.2.5) Pn == — 

where the omitted secondary subscripts are the numbers 1 ， 2, •••,”， 
with the exception of i y j and while yijj.ik it the cofactor of lik 
in Ajj (cf 11.5.3). 


304 



23.3. Residuals. 一 Suppose A n ^ 0. The difference 

(23.3.1) 1 ?i . M . . . n = — At It - ftnfii, 

where the regression coefficients |$u are given by (23.2.3), may be 
considered as that part of the variable which remains after sub¬ 
traction of the best linear estimate of in terms of &，...，&»• 
This is known as the residual of ^ with respect to $，，...， 

The residual is uncorrelated with any of the ，的 Mr acted* variables 
We have, in faot, introducing the expresaioDB of the /?: 8 ， 

i n 

(23.3.2) tji，n .n ^ — 2 k 

■^n ^ 

Hence £ ( 171 . 2 s .. w ) = 0, and 

(23.3.3) E^rji-w, n) = 

11 i-i 

It follows that the residual variance a\. u .n = m. n) is given by 

(23.3.4) aj.aa.. 产 E ⑸ tjn. n) = 1 

and further that the two residuals rji ， 29 . ,n and rjt.ji ： . 9 are uncor- 
related, provided that all subscripts . . . % q of the latter occur 
among the secondary subscripts of the former. 

The residual variance a\. 2.3 .» may, of course, be regarded as a 
measure of the greatest closeness of fit that may be obtained when 
we try to represent bj a linear combination of — In 

the case w = 2, the expression (23.3.4) reduces to al-a = o?(l — ^ # ), in 
accordance with (21.7.2). 

23.4. Partial correlation. 一 The correlation between the variables 

^ and is measured by the correlation coefficient ^ lt , which ie some¬ 
times also called the total correlation coefficient of and Si- If §1 and 
S« are considered in conjunction with ” 一 2 further variables 5» 
we may, however, regard the variation of ^ and as to a certain 
extent due to the variation of these other variables. Now the residuals 
171-34 . .n and ”a.34 . n represent, according to the preceding paragraph, 
those parts of the variables ^ and respectively, which remain after 
subtraction of the best linear estimates in terms of § 8 , . . § n . Thus 

we maj regard the correlation coefficient between these two resi- 

305 


A 


for i == 1, 


-^11 

0 for i = 2, 3,. 


• w. 



23.4 


duals as a measure of the correlation between and after re¬ 
moval of any part of the variation due to the influence of ^ M 
This will be called the partial correlation coefficient of and with 
respect to S s , . . and will be denoted by n. Here the order 
of the subscripts is, of course, immaterial for primary as well as for 
secondary subscripts. 一 We thus have 

(23.4 1) ^12 • 34 . n ^ 

This expression being an ordinary correlation coefficient between two 
random variables, we must have — 1 ^ ^ia. 34 . n ^ 1. 

The residuals 171.34 n and 34 n may be expressed in a form 
analogous to (23.3.2), if we make use of the expression (23.2.5) for the 
regression coefficients in a group of w — 1 variables. We then obtain 
the two following relations analogous to (23.3.4) 

E( V l 31 ..n)=E(l Vl 34 „)= 

.11 yin n 

E( V ] u „) = £(5,^.3. = 

and further 

1 ' A 

^ (^1-34 . A 9；2 • 34 . n) == E (5i J?a. 34 . n) — - 芝又 1* .2* = - * . 

- aa ^ -^11 • m 


E[r)i 34 w • 34 w) 

VE(rj\. 34 n ) E (rjl. 34 n ) 


Inserting these expressions in (23.4.1) we obtain the simple formula 


(23.4 2) 


^12 34 



By index permutation we obtain an analogous expression for the partial 
correlation coefficient of any two variables and 匕 ， with respect to 
the n — 2 remaining variables. 

It ia thus seen that any partial correlation coefficient- may be ex¬ 
pressed in terms of the central moments A,jt» or the total correlation 
coefficients 仍 * of the variables concerned. Thus we obtain, e. g., in 
the case w = 3 


(23.4.3) 


.Pit Qn 

’(1 一 d)(l — A) 


In the particular case of n tmcorrelated variables, it follow® from 
(23.4.2) that all partial correlation coefficients are, like the corresponding 

306 



23.4-5 


total correlation coefficients, equal to zero. We thus have, e. 
^n .34 w == == 0. As soon as there is correlation between the vari¬ 

ables, however, Qu za n is in general different from ^ tt . It is, e. g., 
easily seen from (23.4.3) that Q lt and pis -3 may have different signs, 
and that either of these coefficients may be equal to zero, while the 
other ia different from zero. 

When all total correlation coefficients Qik are known, the partial 
correlation coefficients may be directly calculated from (23.4.2) and 
the analogous explicit expressions obtained by index permutation. 
The numerical calculations may be simplified by the use of certain 
recurrence relations, such as 


(23.4.4) ^i 2 -34 .. I 


gia 34 . n - 1 一 gin-8* n-l gin-9* 

V(l — ^ln -34 « - 1)(1 一 p5n .34 . 


(cf Ex. 11, p. 319), which shows an obvious analogy to (23.4.3). By 
this relation, any partial correlation coefficient may be expressed in 
terms of similar coefficients, where the number of secondary subscripts 
is reduced by one. Starting from the total coefficients g 山 we may 
thus first calculate all partial coefficients Qij.k with one secondary 
subscript, then the coefficients Qij ki with two secondary subscripts, etc. 

Further, when the total and partial correlation coefficients are 
known, any desired residual variances and partial regression coefficients 
may be calculated by means of the relations (cf Ex. 32 一 13, p. 319) 


d.M 

(23.4.5) 


:<rj (1 一 ^Jt) (1 — ^13 . !») ( 1 一分 U . 23) • . • (1 — rfn a 

a _ ^1 • 34 . . n 

P13-34 . n = • S4 .n- - ， 

内 . 84 . . n 


n-l)» 


and the analogous relations obtained by index permutation. It will 
be seen that these relations are direct generalizations of (21.6.9) and 
(21.6.10). — From the last relation we obtain - 

(23.4.6) - S4 ..» ^ ^ia*S4 . .« • S4 . n. 


23.5. The multiple correlation coefficient. 一 Consider the residual 
defined by (23.3.1) 

”1 • M . • • » = — ft ， 5,- /?ln|n = fi — fi, 

where |? = /?n 5 * H - *+ /Si n 5 n is the best linear estimate of ^ in terms 

of §■，• •，&«• It is easily shown that, among; all linear combinations 

307 




23.5 


of g g , . . In, it is gr that has the maximum correlation with as 
measured by the ordinary correlation coefficient The correlation coef¬ 
ficient of the variables ^ and may thus be regarded as a measure 
of the correlation between on the one side, and the totality of all 
variables | 2 , . . on the other. We shall call this the multiple cor¬ 
relation coefficient between f and (f s » . . fu), and write 


(23.5 1) 


(>1 (23 


一 n 

— \ r iWE(iv) 


By (23.3.3) and (23.3 4) we have, however, writing for simplicity t] t 


instead of rj\ o 3 „ 

五 (?i 訂） = 五(匕 ( 心一仏 )) = ^11 — 


E(^\—2£ l r] l + ^) = A U - » 

and thus 


(23.5.2) 


i?i [a 



1- 


P 

K 


Bv (11.10 2) we have A ^ A,, A u , so that E(|, ^ 0, and 

0 ^ pi LM H) = 

When Qk-ij n ) = 1, the variable § t is »almost certainly * equal to a 
linear combination of $ 2 , . . , g n . This means that the total mass of 
the joint distribution of all n variables is confined to a certain hyper- 
plane in R„, so that the distribution is singular, and we have A = P = 0, 
in accordance with (23.5.2). On the other hand, for a non-singular 
distribution it follows from the development (11.5 3) that we have 

1 n 

r». ii) = fj 2 ^ii-apn Qiky 

where the sum in the second member is, by 11 . 10 , a definite positive 
quadratic form in the variables pi，》. Thus 91(23 ， i、= () when 

and only when ^ 1S = = n = 0 i. e. when is uncorrelated with 

every for e = 2, 3, . . n. 

For the numerical calculation, it is convenient to use the relation 
(cf Ex. 13, p. 319) 

(23.5.3) (>* (an ”、 = 1 —- 


308 



23.6 


33.6. Orthogonal mean square regression. 一 The ortbogonal m. sq. regression 
line introduced in 21.6 may be generalized to any number of variables. A hyper- 
plane H passing through the centre of gravity m = 0 of our n dimensional distrilm- 
tion has ttte equation 

彡 l + 夕“ * + … + Ai 雄0， 

where /3 Jt , . fi n denote the generalized direction cosines of the norma) to the plane, 
bo that 2 /9J = 1 . The square of the distance between H and the point * = ^ n ) 

is d % == (2 ^ ^ f )*. Let us try to fiud H such that the mean value E (rf 5 ) becomes as 
small as possible. If such a hyperplane H exists, it will be called an orthogonal 
tn. sq. regression plane of the distribution (cf K. Pearson, Ref. 183 a). 

For a distribution of rank less than n y the problem is trivial, since the whole 
mass belongs to a certain hyperplane H, which must then yield the value E (d*) = 0. 
We may thus suppose that the x distribution is non.singular,、\bich hy 11.9 implies 
that the characteristic numbers x t of the moment matrix A are all positive. Let 
denote the smallest of the characteristic numbers, and let «j, . . , a lt be a solution of 
the homogeneous system 

(又 li — ^o)«i + 又 1*« 2 + + n a rt = 0, 

A*1«1 + Ut2 — X 0 ) + - + A 2 „ a n = 0, 

A n i«i + + - + = 0 ， 

where the a t are not all equal to zero. By 11.8, each a solution certainly exists, 
since \A-y 0 I\- : 0 Further, we may obviously suppose 2 = 1. Then the 
hyptrplane H 0 with the equation S «, = 0 has the required properties. Let, in fact, 

d 0 denote the distance from the point x to ff 0 , while d is the distance to any other 
hyperplone S 1/ — 0* We then have, writing z t =^ — re t and bearing in mind 
that X ccj — ^ /9f — 1, 

E(d % )= 2 4 - z t ){a k + z k ) 

i,k 

=SA a «,«* + 2S 〜 z k 

I, k t k t, L 

==E [dD + 2 心 £ z t + 2 k z t z k 

t t,k 

=E (dl) + — x Q s ik )z t z k , 

，， * 

where the are the elements of the unit matrix /. Since x 0 is the smallest char¬ 
acteristic number of A, the matrix A 一 x 0 J is by 11.10 non-negative, and thus we 
have E {d*) ^ E (dl). 

It can further be shown that, if x 0 is a simple root of the secular equation of 
A t the orthogonal regression plane H 0 found in this way is unique, whereas if x % 
is » multiple root, there are an infinity of planes with the required properties. 

These results become intuitive, if we remember that, by (11.0.6), the reciprocal 

309 




23.6-24.1 


matrix has tbe characteristic numbers —, so that the squares of the principal 

x i 

axes of the concentration ellipsoid (22.7.1) are proportional to the nambers This 

shows that the orthogonal m 8(( regression plane is orthogonal to the smallest axis 
of the concentration ellipsoid, and is thus determinate or Indeterminate, according as 
this smallest axis is unique or not. 

We can also define a straight line L of closest ft to the distribution, by the 
condition that E(6 l ) should he a minimum, where <F deuotes the shortest distance 
between L ami a point x. It can be shown that this line coincides with the greatest 
axis of tbe concentration ellipsoid. 

CHAPTER 24. 

The Normal Distribution. 

24.1. The characteristic function. 一 As in the two variable case 
(21.11 and 21 12), we introduce first the c f of the normal distribu¬ 
tion. Let 

^(t) — Q , t n ) — tk 

denote a 11011 -negative quadratic form in t = (t i% . f n ), while 
m = (m,, , m n ) is a real vector. Wc shall then show that the fundwn 

t£me— ,t„) 

(24.1 1) gp(e) = (p[t v , ^») = o J 

is the c. f. of a certain distribution in R n . This distribution will be 
called a normal distribution 

Before proceeding to the proof of this statement, which will be given 
in the two following paragraphs, we shall make some introductory 
remarks. 一 In matrix notation (cf 11.2 and 11.4)，the expression 

(24.1.1) of the c. f. may be written 

(24.1.2) = * 

The development (22.4.1) shows that the quantities m 3 and 又 ,* have 
here their usual signification as mean values and second order central 
moments. By (22 4.3), it further follows that any marginal disinbution 
of a normal distribution is itself normal. 

If the moment matrix A = {Xjk\ is a diagonal matrix, the c. f 
(24 1.1) breaks up into a product (p x (/,) . where each factor 

310 



24.1-2 


is the c. f. of a one-dimensional normal distribution. Thus' n uncor¬ 
related and normally distributed variables arc always independent. 

As In the two-variable case, we shall have to distinguish two cases, 
according as the non-negative form Q is definite or semi-definite. 
Obviously we may suppose throughout that m = 0， since this only 
involves the addition of a constant vector to the variable x = |r») 

We use the same notations for moments, correlation coefficients etc. 
as in the preceding chapters. 


24.2，The non-singular normal distribution. — If the quadratic 
form Q is definite positive, the reciprocal form Q -1 exists, and we 
have (cf. 11.7) 

Q(^) ~ • •» ^n) — tky 

= . 。卜2今 OCjTi. 


(Since the moment matrix A is symmetric, we are entitled to write 
yiji instead of Aky) By (11.12.1 b) we then have 


(2nyVAR„ 




2 9 厂 Jq - 1 (*" , x „) 


dx, . . . dx„ = U 


This shows that the function 

/w 

(24.2.1) 


1 - — 


(27r) 2 V A 




(2 7T 卩 . . . O n V P 


is a probability density 
(24.2.2) 


R Ut with the c.f. 


Substituting in (24.2.1) x 3 — mj for x Jt we obtain the fr. f. of the 
general non-singular normal distribution in JR*，the c. f. of which is 
given by (24.1.1). For this distribution, the family of hoinofchetic 

311 




24.2-4 


ellipsoids 2 為一 n h) ( 工 * — wu) = c* generated by the concentra¬ 
tion ellipsoid (22.7.1) are equiprobability surfaces, the fr. f. being on one 
of these surfaces proportional to (cf Ex. 15, p. 319). 

24.3. The singular normal distribution. — When the non-negative 
form Q is semi-definite, no reciprocal form exists, and the expression 
(24.2.1) for the fr. f. becomes indeterminate. As in the two-dimensional 
case (cf 21.12) we find, however, that the function (p (f) = 
may be represented as the limit of a sequence of functions of the 
same type, but with definite forms (We may, e. g., take 

<?. = <? H- 2 where «, -► 0 ) By the continuity theorem of 10.7, 

it then follows that the corresponding non singular normal distribu¬ 
tions tend to a limiting distribution, and that (p (t) is the c f. of 
this limiting distribution, which will be called a sinqular normal 
ilistributton. 

If the rank of the semi detinite form Q is denoted b、；.，we have 
r < w, and the moment matrix A of the variables 5,, . . has the 
same rank r It then follows from 22 5 that the total mass of the 
distribution is confined to a certain linear set L r of r dimensions 
Further by 22 6 the variables 5,, . , may with a probability equal 

to 1 be expressed as linear functions of r uncorrelated variables 
which are themselves linear functions of the Now it 
will be shown in the following paragraph that any linear functions 
of normally distributed variables are themselves normally distributed, 
and by 24.1 we know that uncorrelated normally distributed variables 
are always independent. Hence we deduce the following theorem * 

If the n oar tables are distributed in a vorma/ distribution 

of rank r, they can with a piobability equal to 1 be expressed as linear 
functions of r vulepemlent ami normaUy distributed variables. 一 Ob¬ 
viously this theorem holds true also for r = n. 


24.4. Linear transformation of normally distributed variables. 一 - 

The expressions normal distribution and normally dtsti ibuted variables 
will in the sequel always be understood so as to include singular as 
well as non-singular distributions. 

Let the variable x = (J,, . . have a normal distribution in JR", 
such that m = 0 By the linear transformation (22.6.1), we introduce 
a new variable y — (rj u . ” 7j m ), where m is not necessarily equal to w. 

312 



24,4-5 


In matrix notation we then have y = Cx y where C == C mu . Between 
the moment matrices A and M of x and y, we have by (22.6.2) the 
relation M = CAC\ which holds even when m 〆 0. 

We shall now try to find the c. f. of y. By (24.1.2), the c. f. of 
x is in matrix notation 

9 P (t) — JB (p» *) = e~i %， A 

If we replace here t by a new variable u by means of the contra- 
gredient substitution t = C' u, we obtain according to (22.6.3) the 
c. f. \f){u) of y. We thus have 

tp(u) = E (e^ u ， y) — e~\ u，c AC u = c -^ u* M u 

The last expression is, however, the c. f. of a normal distribution in 
Am. with the moment matrix M. Thus any numbei' of linear functions 
of normally distributed variables are themselves normally distributed 一 
The remark of 24.1 that any marginal distribution of a normal distri¬ 
bution is itself normal, is included as a particular case in this pro¬ 
position. 

24.5. Distvibution of a sum of squares. 一 In 18 1, we have stu- 
died the distribution of the sum 2!:， where the are independent 

l 

and normal (0, 1). This is the x 2 distribution with n degrees of free¬ 
dom, and the fr. f. of is the function k n (x) defined by (18.1 3) 

On a later occasion (cf 30.1 — 30 3), we shall require the distribu¬ 
tion of 2 Si in the more general case when are normally 

distributed with zero means and a moment matrix A, the characteristic 
numbers (cf 11.9) of which are all equal to 0 or 1. Suppose that p 
of the characteristic numbers are 0, while the n—p others are 1. 
Then we may find an orthogonal transformation y == C x replacing 
the old variables x = (I” • . ” by new variables y = (% ， ... ， 巧， ,)， 
such that the transformed moment matrix M== C A C r is a diagonal 
matrix with its n 一 p first diagonal elements equal to 1, while the p 
otliers are 0. This implies, however, that the new variables tj x ,.. rj n ^ p 
are independent and normal (0, 1), while rjn — p+i，•，，. V" have zero 
means and zero variances, and are thus with the probability 1 equal 
to zero. Hence we have with the probability 1 

313 



24.5-6 


» n »—p 

2 ㈢ u 2 ” 卜 2 

1 1 1 

n 

Thus 2 S* IS distributed as the sum of the squares of n — p nule- 

l 

n 

pendent variables that are normal (0, 1), i.e. has the x’ distribution 

i 

with n — p degrees of freedom, and the fr. f. 1c A - p (x). 

We finally consider the still more general case of a sequence of 
variables *\ x \ . • •， such that the distribution of the general term 
x ~ (5i» • • ， 5n) tends to a normal distribution of the type considered above. 
Applying 10.7 and the inulti*dimensional form of (7.5.9) to the c. f. of 

w n 

2 ^ then foliows that, in the limit、the sum of squares 2 & ^ a8 a 

1 i 

X 9 distribution with n — p degrees of freedom. 

24.6. Conditional distributions. 一 Let be n variables 

having a non-singular normal distribution with m = 0, the fr. f. of 
which is given by (24.2.1). The conditional fr. f. of a certain number 
of the variables, when the remaining variables assume prescribed 
values, is given bj an expression of the form (22.1.1)，and it is easily 
seen that in the present case this is always a non-singular normal fr. f. 
We shall treat as examples the conditional distributions for one and 
two variables. 

One variable. 一 The conditional fr. f. of f,, relative to the hypo¬ 
thesis =-Xi for t = 2, . . w, is bj (22.1.1) 

厂知 〜 m 

\x v . . a*n) — ■■■^ 

f e -b lxi 

-00 

( A » + A U / « a '*) 

=Ae 2 

=ife , 

where A and B are independent of but may depend on x t% • • ” 而 . 
Now we know that the last expression is a fr. f. in x ti and it follows 

314 



24.6 


that we must have B = |^/ U- ， so that the conditional distribution 
of $1 is a normal distribution with the variance and the mean 


wii (x ti … ， a:„) 




A\t 


X n 


^11 
+ filnln ， 


■where the 卢 :8 are the regression coefficients given by (23.2 3). Thus 
the regression is linear, and accordingly (cf 23.2) we find that the 
regression surface for the mean of coincides with the m. sq. regres¬ 
sion plane. We further observe that the conditional variance is 

independent of x t> ... y x„ } and is equal to the residual variance 
E(rjl. 2 z n ) as given by (23.3.4). 


Two variables. — The conditional fr. f. of ^ and g, is 

f (*^|i I • • • t 怎 ”）： 


S A Jk x j x k 


f f e~^~ J ~dx x dx t 

— 90 —OO 

C 它 ) + i>4T, + Ex t 


where (7, D and E are independent of ar, and x t . We now introduce 
three quantities s t and r defined by the expressions 


- - dj± 



= - —,，•= 一 

dll.22 


We then obtain by (11.7.3) 



1 

^\\ ^\\ 27 

=么， 

A 

(l-r*)s ； 



and in a similar way 

. 1 ___ _ r ^ ] i* 

(1 ~ A (I — ~ A 


so that 


315 



24.6-7 


Comparing this with the expression of the two-dimensional normal 
fr. f. given in 21.12, we find that the conditional distribution of 
and 5s a non-singular normal distribution with ^he conditional 


variances -7—-- and , and the conditional correlation coefficient 


-4 n . 2 


A, 




We observe that all these three quantities are indepen¬ 


dent of 〜• . • ， x n . The variances are identical with the variances of 
the residuals rj\ 34 n and 1 ^ 2-34 n studied in 23.4, while the condi¬ 
tional correlation coefficient is identical with the correlation coefficient 
of these two residuals, or the partial correlation coefficient 纪 12 . 3 * « 

as given by (23.4 2). For the normal distribution, the latter coefficient 
has thus the important property of showing not only the correlation 
between the residuals but, moreover, the correlation between ^ and 
§ s for any fixed values of 


24.7. Addition of independent variables. The central limit theorem. 
— The sum of two w-dimensional random variables x = , f n ) 

and y = (iy t , . . , rj n ) is defined as in the two-dimensional case (cf 21.11), 
by writing 龙 + y = (fi 十 ” 卜 ..•，fn + rj n ) As in 21.11, it is proved 
that the c. f. of a sum of independent variables is the product of the 
c.f:s of the terms. 

The expression (24.1.1) for the c. f. of the normal distribution 
further immediately shows that the sum of any mimber of normal!}/ 
distributed and independent variables is itself normally distributed, as 
proved for the one-dimensional case in 17.3. 

In 21.11, we have considered a sum of a large number of indepen¬ 
dent two-dimensional variables, all having the same distribution. We 
have proved that, if the sum is divided by the square root of the 
number of terms, the distribution of this standardized sum tends to 
a certain normal distribution, as the number of terms tends to in¬ 
finity. A straightforward generalization of the proof cf this theorem 
shows that the theorem holds for variables in any number of dimen¬ 
sions. 一 This is the generalization to n dimensions of the Lindeberg- 
Levy theorem of 17.4, and thus forms the simplest case of the 
Central Limit Theorem for variables in R n . The general form of this 
theorem asserts that, subject to certain conditions, the sum of a large 

310 



Exercises 


number of independent n-dimensional random variables is asymptotically 
normally distHbuted. 一 The exact conditions for the validity of the 
theorem, in the general case when the terms may have unequal dis- 
tributions，are rather complicated, and we shall not further into 
the matter here. A fairly general statement will be found in Cramer ， 
Ref. 11, p. 113. 


Exercises to Chapters 21 — 24. 


1. I aud rj are two variables with finite second order moments. Show that 

D % (| + I?) = + when* and only when the variables are uncorrelated.* 

2. Lefc (p%{t) and (p{t) denote the c f.a of tj, and { + i? respectively. It 

has been shown in 16 12 that q>(t) — <pi (t)(p%(t) when ^ and rj are independent. Con¬ 
versely, if we know that ip (f) == <f x (f) <p t {t) tor all t, does it follow that { and rj are 
independent? 一 Consider the fr. f. f 、 x 、 y) — i[1 + x,y(** — y*)], (| a™| < 1, \y\< 1), and 
show by means of this example that the answer is negutive. 

3. Consider the expansion (21 3.2) for the c. f of a two-dimensional distribution. 
Show that, if the distribution has finite moments of all orders, thin expansion may 
be extended to terms of any degree in t and u Use this expansion to show that, 
for the normal distribution, any central moment fi ik of even order i-\- k — 2n is 

ilk! , , 

equal to the coefficient of t 1 u k m the polynomial — (f4 i0 C + 2,an (u -I- ,« 0 s w*) M . 

4. The joint distribution of ^ and r\ id normal,、、ith zero mean values and the 
correlation coefficient q. Show that the correlation coefficient of and if is q x . 

5. Consider two variables § and rj with a joint distribution of the continuous 
type, and let <p(t, u) denote the joint v f. Using the notations of 21.4, we then have 


l^<p\ 

\0 u» /u=o 


i，l 


e ttx dx I y n /( r, y) dy = i n I E(rj n 11 = s)J\(c) d r. 


Conversely, there is a reciprocal formula analogous to (10.3.2: 


E(tj n 11 = x)= 




2nt^J x 






«，"《 二 o 


(I t t 


if the last integral is absolutely convergent Vse this result to deduce the properties 
given in 21.12 of the conditional mean and the conditional variance of the normal 
distribution. 

6 . We use the same notations as in the preceding exercise, and suppose that i\ 
»s never negative. If the integral 


夕 (和 



dt 


317 



Exercises 


is uniformly convergent with respect to r t it represents the fr. f. g {x) of the variable- 

(Generalization of Cramer, Ref. 11, p. 46, who gives the proof for the particular 

case when ^ and 1 / are independent.) Use this result to deduce the distributions of 
18.2 and 18.8, and generalize Student's distribution to the case when the variable $ 
in (18.2.1) is normal (m, a), wh«re m 一 .0 (the »non-centrnl» ^-distribution). 


7. Find the necessary and sufficient conditions that three given numbers ^ lt » 
p ls and p ts may be the correlation coefficients of some tbree-dimeDsiooal distribution. 
Find tbe possible values of c in the particular case when pn =** Pit s Qn — c - 

8. Each of the variables x, y nod z has the mean 0 and the s. d. 1. The vari¬ 
ables ratisfy the relation ax + by cz — 0. Find the moment matrix A» and show 
that we must have a* + 6 4 + c 4 ^ 2 (a* ft* + o* c* + b % c 9 ). 


9. A certain random experiment may produce any of n mutually exclusive events 

w 

£ lt .., En, the probability of Ej being pj > 0, where £ 巧 8=3 1 In a series of -V 

l 

« 

repetitions, Ej occurs Vj times, where J v 广 N. Show that the probability of this 

l 

iVl 

result is — - —The joint distribution of , vn defined by these 

probabilities is a generalization of tbe binomial distribution, known as the multinomial 
diitribution. Show that for this distribution m^ = E(v^)-= Npj t Xjj =» E(vj— Npj) % = 
一巧 ), 又 jfc ■■ 霣 ((〜 一 ^Pj) 办 *)) = 一 ^PjPl- For the moment matrix yl, we 

have -1 0 and = N n_1 pi p%. .pn 0, so that the rank of the distribution is « — 1. 

n 

in accordance with the relation 2 between the variables. 


Show that p lt = 




pi Pt 


Pi)U —p%) 


.广 - j /^二;^ •二 .. ■•一 


PIPt 


Pj) U 一 P* 一 Pi • 


. 巧） 


for ^ ~ 8 ,..n. 

Show farther that the joint c f. of the Tariables Xj — — ~-^1 in y (fj, •. • ， t n ) s 


-'M 萬广 As X—* <», ip tends to the limit 

，*( 和 -( ㈣ 


3 


v ^Pi 


This is the c f of a normal distribution in Rn. Show that this distribution is of 

n — 

rank n — 1， and that the variables satisfy the relation =» 0. Find q 1% and 


^ii-S4 


318 



Exercises 


10. Take in the multinomial distribution Pj for i — 1,.. n — 1, and p n - 
又 l + … + 又 n — 


iV 


Investigate the limiting diMkribution as N~* • (multidimensional 


Poisson distribution) 

11. Show that the residual t/j. 23 „ defined by (23 3 1) may also be interpreted 
as the residual of the variable ，/! 23 n _j with reHpect to the single variable 
"m 23 n-i Show that, by means of this result, the formula (23.4.4、for the partial 
correlation coefficient may he deduced from (23.4.3) 


12. Use the result of the preceding exercise to prove the relation 
23 »l) ^ n-l) 23 M-l)- 

This show8 that the representation of ^ by • means of n linear combination of 
| 2 ,.. ，匕一 i、、ill be improved by including also the further variable when »d<1 
«inl> when P lM 23 ，卜 i 卢 0 . 


13. Prove the relations (23 4 6) and (23 5 3). 


14. Use the contiouity theorem 10.7 to prove the following proposition If a 
sequence of normal distributions in R tl converges to a distribution, the limiting 
distribution is normal. (Note that, in accordance with 24 4, the expression »normnl 
distribution* includes singular as well as non singular distributions ) 


15. The variables Ij, ., have a non-singular normal distribution, with the 
mean values wi lf ， m H and the moment matrix A. Use (11.12.3) and the final 
remark of 24 2 to show that the variable 


v = 2 - ’ ， 0)A 

J.k-l 

has a ^-distribution with n degrees of freedom, the fr. f. being given by (18.1 3). 

16. are independent and normally distributed T&riables, all having the 

same 8. d. <r, while the mean values may be different. New variables >; p . • ， tj n are 
introduced by an orthogonal transformation. Show by means of 24.4 that the ij t are 
independent and normally distributed, all having the same a d. cr a8 the 


319 




THIRD PART 

STATISTICAL INFERENCE 




Chapters 25-26. Generalities. 


CHAPTER 25. 

Preliminary Notions on Sampling. 

25.1. Introductory remarks. — In accordance with our general 
discussion of principles in Chs 13 一 14, the whole theory of random 
variables and probability distributions developed in Part II should be 
considered & 8 a system of mathematical propositions designed to form 
a model of the statistical regularities observed in connection with 
sequences of random experiments. 

As already pointed out in 14.6, it will now be our task to work 
out methods for testing the mathematical theory by experience, and 
to show how the theory may be applied to problems of statistical in¬ 
ference. 一 These questions will form the subject matter of Part III 

Among the sets of statistical data occurring in practical applica¬ 
tions, we may distinguish certain general classes which, in some ways, 
require different types of theoretical treatment. In tbe present chap¬ 
ter, we shall give a few brief indications concerning some of the most 
important of these classes. 一 The following chapter will be devoted 
to a preliminary survey of questions of principle connected with the 
testing and applications of the theory. 

25.2. Simple random sampling. — Consider a random experiment 

connected with a one-dimensional random variable If we make 

n independent repetitions of 这 ， we shall obtain a sequence of n ob¬ 
served values of the variable, saj x x> x i% . . x n . 

A sequence of this type, forming tbe result of n independent re¬ 
petitions of a certain random experiment, is representative of a simple 
but fundamentally important class of statistical data. With respect 
to data belonging to this class, we shall often use a current termino¬ 
logy derived from certain particular fields of application, as we are 
now going to explain. 

Consider a random experiment @ of the following type: A certain 
set containing a finite number of elements is given, and our experi- 

323 




25.2 


ment consists in choosing at random an element from the set, observing 
the value of some characteristic § of the element, and then replacing ： 
the element in the aet. It is assumed that the experiment is so ar¬ 
ranged that the probability of being chosen is the same for all ele¬ 
ments. — Using 1 expressions borrowed from the statistical study of 
human and other biological populations, we shall talk of the given 
set as the parent population 、 and of its elements as members or indivi¬ 
duals (ci 13.3). The group of individuals observed in the course of n 
repetitions of the experiment @ will be called a random sample from 
the population, and the sampling process thus described will be denoted 
as simple random sampling • 

Often we are not interested in the individuals as such, but only 
in the values of the variable characteristic f and their distribution 
among the members. In such cases we shall find it advantageous to 
consider the parent populations as composed, not of individuals, but of 
talues 0 / A sequence of n observed values x u . . x n will then be 
conceived as a random sample from this population of ^-values. Talking 
from this point of view, we may replace the parent population by an 
urn containing one ticket for each member of the population, with 
the corresponding value of § inscribed on it. The experiment (S will 
then consist in drawing at random a ticket, noting the value inscribed, 
and replacing- the ticket in the urn. 

As there are only a ftnite number of tickets in the urn, the random 
variable § will only have a finite number of possible values, so that 
its distribution will be of the discrete type (cf 15.2). By taking the 
number N of tickets very large, this distribution may, however, be 
made to approximate as closely as we please to any distribution given 
in advance, and when N tends to infinity the error involved in the 
approximation may be made to tend to zero. As a matter of illustra- 
tton, we may thus interpret anj type of random experiment (£ as the 
random selection of an individual from an infinite parent population 
(cf 13.3). We then imagine an urn containing 1 an infinite number of 
tickets, on each of which a certain number is written, in such a way 
that the distribution of these numbers is identical with the distribution 
of the random variable § associated with 6. Each performance of G 
is now interpreted as the drawing of a ticket from this urn, and a 
sequence x lt . . x n of observed values of f is regarded as a random 
sample from the infinite population of numbers inscribed on the 
tickets. The values x n will accordingly be called the sample 

values. 


324 



25.2-3 


It must be expressly observed that this extension of the Mea of 
sampling to the case of an mfnite population should be regarded as 
a mere illustration for the purpose of introducing a convenient ter¬ 
minology, and should by no means be taken to imply that conceptions 
such as the random selection of individuals from an infinite population 
form part of our theory. 

Bearing this reservation in mind we shall, however, often find it 
convenient to use the sampling- terminology in the extended sense 
suggested above. A set of observed values of a random variable with 
a certain d f. F(x) will thus often be regarded as a i avdom sample 
from a population having the d f F(.r) or, as we shall sometimes 
briefly say, a random sample from the distt i but ion con espondinq to F(x) 

Whenever in the sequel expressions such as > sample ^ or > stiin))linij^ 
are used without further specification, it will alwa)s be understood 
that we are concerned with simple random sampling 

All the above may be directly extended to the case of a random 
variable in any number A of dimensions Everv individual in our 
imaginary infinite population will then be characterized by a set of 
k numbers, and any sequence of observed values of the ^ dimensional 
random variable may be interpreted as a random sample from such an 
infinite /-dimensional population 

25.3. The distribution of the sample. 一 Consider a sequence of u 
observed values a:,, . , x n of a one-dimensional random variable S 

with the d f. F(x). According to the preceding* paragraph, we may 
regard x,, . . . x„ as a set of sample values, »drawn > from a popula¬ 
tion with the d f. i,(x) The sample may be geometrical!} represented 
by the set of n points x lf . x n on the cc axis. 

The distribution of the sample will then be defined as the distribu¬ 
tion obtained by placing a mass equal to l/n in each of the points 
a*,, . . x n . This is a distribution of the discrete type, having 1 n 
discrete mass points (some of which mav. of course, coincide) The 
corresponding d f., which will be denoted by 1 * (x), is a step function 
with a step of the height l/« in each x, If we denote by r the 
number of sample values that are ^ x, we evidently have 

(25.3.1) F ' (.t) = - 

so that F* (a 1 ) represents the frequency ratio of the event x in 
our sequence of n observations 


325 



i e the arithmetic mean of the v th powers of the sample values, 

while the corresponding moment of the population is a v = f x*dF(x) 

— 00 

The above definitions directly extend themselves to samples from 
multi dimensional populations. Suppose e. g that we have a sample of 
n pairs of values (x XJ y,), • , (x n , y n ) of a two-dimensional random vari¬ 

able. This sample may be geometrically represented by the set of n 
points (xj, y,), • • • ， (x«, y n ) in a plane, and the distribution of the sample 
is the discrete distribution obtained bj placing a mass equal to 1/n 
in each of these n points. For this distribution, we may calculate 
moments, coefficients of regression and correlation, and other char¬ 
acteristics according to the general rules for two-dimensional distribu¬ 
tions given in Ch. 21. These are the moments etc. of the sample as 
distinct from the corresponding characteristics of the distribution (or 
of the population) 一 The extension to samples from populations of 
more than two dimensions is obvious. 

The distribution of a sample, as well as the moments and other 
characteristics of such a distribution, will play an important part in 
the sequel. In this connection, we shall use a particular system of 
notations that will be explained in 27.1. 


25.3 

Obviously this distribution is uniquely determined by the sample. 
On the other hand, two samples consisting of the same values in 
different arrangements will give the same distribution. The distribu¬ 
tion determines, in fact, only the positions of the sample values on 
the x-axis, but not their mutual order in the sample. 

For the distribution thus defined, with the d. f. we may 

calculate various characteristics such as moments, semi-invariants, coef¬ 
ficients of skewness and excess etc., according to the general rules for 
one-dimensional distributions given in Ch. 15. These characteristics 
will be called the moments etc. of the sample^ as distinct from the 
corresponding characteristics of the distribution associated with the 
random variable E and the d. f. F(x) The latter characteristics will 
also be called the moments etc. of the population. 

Thus e. s. by 15 4 the v th moment of the sample is 


n 2 I 

1 - w 

)= 

- - 


326 



25.4-25.5 


25.4. The sample values as random variables. Sampling distri¬ 

butions. 一 In order to obtain a sample of n values of a one dimen- 
sional random variable with the d. f. F(x), we have to perform a 
sequence of n independent repetitions of the random experiment to 
which the variable is attached. This sequence of n repetitions forms 
a combined experiment, bearing on n independent variables r l7 . . r„, 

where x t is associated with the i th repetition of ii. The sample values 
x x> . . x n that express the result of such a combined experiment thus 
give rise to a combined random variable (jr iy . ” x n ) in n dimensions, 
where the are independent variables, all of which have the same 
d. f. ■F(x). The values of x x , . . x n observed in an actual sample form 
an observed * value» of the w-dimensional random variable (x ly . • ， .r M ). 

When the sample values are thus conceived as random variables, 
any function of a:, ， • • • ， ar n is by 14 5 a random variable with a distri¬ 
bution uniquely determined by the joint distribution of the i. e. bv 
the d. f. F[x). Now any moment or other characteristic of the sample 
is a certain function g[x x , . . , x n ) of the sample values ('ov^equonUif 
any sample characteristic gives rise to a raudom loanable with a dtstnbu- 
tion uniquely determined btj JP(.r). 

If samples of n values are repeatedly drawn from the same popula¬ 
tion, and if for each sample the characteristic g(x x , . , JC n ) is calcul¬ 
ated, the sequence of values obtained in this way will constitute a 
sequence of observed values of the random variable g(x v . , .r M ) The 

probability distribution of this variable will be called the samphvg 
distribution) of the corresponding characteristic. 

These remarks are immediately extended to the case of samples 
from multi-dimensional populations. In the same sense as above, the 
sample values will here be conceived as random variables. Further, 
any moment, correlation coefficient or other characteristic of such a 
sample is a function of the sample values, and thus gives rise to a 
certain random variable, the distribution of which is uniquely deter¬ 
mined by the distribution of the population. This is the sampling 
distribution of the characteristic. 

Thus we may talk of the samplings distribution of the mean of a 
sample, of the variance, the correlation coefficient etc The properties 
of sampling 1 distributions of various important sample characteristics 
will be studied in Chs 27 — 29. 

25.5. Statistical image of a distribution. 一 As an example of the 
concepts introduced in the preceding paragraph, we consider the d. f 

327 



25.5 



F*(x) of a one-dimensional sample, which by (25.3.1) is a function of 
the sample values, containing a variable parameter x. As observed in 
25 3, F*(x) is equal to the frequency of the event ^ ^ x in a sequence 
of n repetitions of Q. Now, by the definition of the d. f. F(x) of the 
variable the event ^ ^ x has the probability F(x). Thus it follows 
from the Bernoulli theorem, as interpreted in 20.3, that F*(x) con¬ 
verges in probability to -F(a ;)， as w oo. 

When n is large, it is thus practically certain that the d. f. F*(x) 
of the sample will be approximately equal to the d. f. F(x) of the 
population. Consequently we may regard the distribution of the sample 
as a kind of statistical image of the distribution of the population. 
The graph y = F^(x) of the step-function F* (a;) is known as the sum 
polygon of the sample. For large values of w, this will* thus be ex¬ 
pected to give a good approximation to the curve y = F(x). As an 
example, we show in Fig. 25 the sum polygon for a sample of 100 
mean temperatures in Stockholm for the month of June (cf Table 
30.4.2), together with the (hypothetical) normal d. f. of the corresponding 
population. 

In practice, samples from continuous distributions are often grouped. 
This means that we are not given the individual sample values, but 
only the number of sample values falling into certain specified class 

328 



25.5 



Fig. 20 Histogram for the breadths of 12 000 beans, and fre<|uency curve necordui^ 
to Edgeworth's series The peale on the horizontal axis refers to a conventional 
numeration of the class intervals 

intervals We then take every class interval as the basis of a rectangle 
with the height where h is the length of the interval, while v 

denotes the number of sample values in the class The figure obtained 
in this way is the histogram of the sample The area of any rectangle 

y 

in the histogram is equal to the corresponding class frequency ^ 

For large n this may be expected to be approximately equal to the 
probability that an observed value of the variable will belong to the 
corresponding class interval, which is identical with the integral of 
the fr f f{x) over the interval. Thus the upper contour of the histo¬ 
gram will form a statistical image of the fr f ， in the same way as 
the sum polygon does so for the d. f. As an example，we show in 
Fig. 26 the histogram of the sample of 12 000 breadths of beans given 
in Table 30 4 3, together with the (hypothetical) fr. f. of the corre¬ 
sponding population, according to the Edgeworth expansion (17.7 5) 
Analogous remarks apply to the distribution of a sample in any 
number of dimensions. Later on, we shall find that the same kind of 
relationship also exists between the various characteristics of the 
distributions of the sample and of the population. It will, in fact, 
be shown in 27.3 and 27.8 that, under fairly general conditions, a 
characteristic of the sample converges in probability to the corre¬ 
sponding characteristic of the population, as the size of the sample 
tends to infinity. In such cases, the sample characteristics may be 
regarded as estimates oi the corresponding population characteristics 
The systematic investigation of such estimates and their probabilititv 
distributions will, in the sequel, provide some of the most powerful 
tools of statistical inference. 


329 



25.6 


25.6. Biased sampling. Random sampling numbers, — When we 
are concerned with a finite parent population, the idea of simple 
random sampling has a precise and concrete significance. We may 
always imagine an experimental arrangement satisfying the conditions 
for a random selection of individuals from such a population, with 
equal chances for all the individuals，even though its practical realiza¬ 
tion may sometimes be exceedingly difficult. In practice there will 
often be a bias in favour of certain individuals or groups of individuals, 
and accordingly we then talk of a biased sampling. Experience shows 
e. g. that such a bias is always to be expected when the selection of 
individuals from a population is more or less dependent on human choice. 

It does not; enter into the plan of this book to 它 ive an account 
of questions belonging to the technique of random sampling^ such as 
the arrangements by which bias may be as far as possible eliminated. 
We shall only remark that in many cases it is possible to use with 
advantage some of the published tables of random sarnpltmf numbers. 
(Ref_ 262 ， 263, 267.) Such a table consists of a sequence of dibits 
intended to represent the result of a simple random sampling from a 
population consisting of the ten digits 0, 1, . . M 9. By joining two 
columns of the table we may obtain a sequence of numbers formed 
in the same way from the population consisting of the 10* numbers 
00,… ， 99, and similarly for three, four or any larger number of columns. 

Suppose that we want to use such a table to draw a random sample 
of 100 individuals from a population consisting of t say，8183 mem¬ 
bers. The members are first numbered from 0000 to 8182. We then 
read a sequence of four-figure numbers from the table, disregarding 
numbers above 8182, and go on until we have obtained 100 numbers. 
Our sample will then consist of the members corresponding to these 
numbers. If the sampling is to be made without replacement (cf 25.7), 
we must also during the course of reading the numbers from the table 
disregard any number that has already appeared. 

The tables may also be used to obtain a sample of observed values 
of a random variable with any fjiven d. f. F(x). Suppose that we dispose 
of a table of values of F(x) that enables us, for every m-figure number 
,.，to solve the equation F(a r ) = r • 10~ m with respect to a r - From our 
table of random numbers, we now read a sequence of m-figure num> 
bers r, and determine the sample values x such that the x corre¬ 
sponding to any r falls in the interval a r < x ^ a r +i. Thus we obtain 
in this way a grouped sample: the sample values are not exactly deter¬ 
mined, but the process yields the number of sample values belonging 

330 



35.6-7 


to any 
sample 


interval (a r , 办 +i), and it is seen that the probability for any 
value to fall in this interval has the correct value 


jP(flr+l) 一 = lO"**. 


The larger we take m, the finer is the grouping and the more accurate 
tke determination of the sample values. 一 Further discussion of the 
tables of random sampling numbers and their use will be found in the 
introductions to the tables and in two papers by Kendall and Babing- 
ton Smith (Ref. 137). 


25.7. Sampling without replacement. The representative method. 
— In practice, a sample from a finite population is often taken in 
such a way that a drawn individual is not replaced in the population 
before the next drawing. A sequence of drawings of this type has 
obviously not the character of repetitions of a random experiment 
under uniform conditions, since the composition of the population 
changes from one drawing- to another. We talk here of sampling 
without replacement 、as distinct from simple random sampling, which 
is a sampling with replacement. When the population is very large, 
and the sample only contains a small fraction of the total population, 
it is obvious that the difference between these modes of sampling is un¬ 
important. and in the limiting case when the population becomes infinite, 
while the size of the sample remains finite, the difference disappears. 

Sampling- without replacement plays an important part in applied 
statistics. When it is desired to obtain information as to the charac¬ 
teristics of some large population, such as the inhabitants of a country, 
the fir-trees of a district, a consignment of articles delivered by a 
factory etc., it is often practicailj impossible to observe or measure 
every individual in the whole population The method generally used 
in such situations is known as the repiesentahoe method: a sample of 
individuals is selected for observation, and it is endeavoured to make 
the sample as representative as possible of the total population. The 
observed characteristics of the sample are then used to form estimates 
of the unknown characteristics of the total population Usually in 
such cases samples are taken without replacement. The method of 
selection may be random or purposive, in the latter case we deliberate】' 
choose the individuals entering into our sample in order to obtain a 
representative sample Often also mixed methods are used. 一 For the 
theory of the representative method, we refer to Neyman, Ref H51 
Some simple cases will be considered in 34 2 and 34 4 


3at 



26.1-2 


CHAPTER 26. 

Statistical Inference. 

26.1. Introductory remarks. — It has been strongly emphasized 
in 13.4 that no mathematical theory deals directly with the things of 
which we have immediate experience. The mathematical theory be¬ 
longs entirely to the conceptual sphere, and deals with purely abstract 
objects. The theory is, however, designed to form a model of a certain 
group of phenomena in the physical world, and the abstract objects 
and propositions of the theory have their counterparts in certain 
observable things, and relations between things. If the model is to be 
practically useful, there must be some kind of general agreement 
between the theoretical propositions and their empirical counterparts. 
When a certain proposition has its counterpart in some directly 
observable relation, we must require that our observations should, in 
fact, show that this relation holds. If, in repeated tests, an agree¬ 
ment of this character has been found, and if we regard this agree¬ 
ment as sufficiently accurate and permanent, the theory may be ac¬ 
cepted for practical use. 

In the present chapter, we shall discuss some points that arise 
when these general principles are applied to the mathematical theory 
of probability. We shall first consider the testing of the agreement 
between theory and facts, and then proceed to give a brief survey of 
the applications of the theory for purposes of statistical inference. 

26.2. Agreement between theory and facts. Tests of significance. 
— The concept of mathematical probability as defined in 13.5 has its 
empirical counterpart in certain directly observable frequency ratios 
The proposition : »The probability of the event E in connection with 
the random experiment g is equal to P» has, by 13.5, its counterpart 
in the statement denoted as the frequency interpretation of the prob¬ 
ability P, which runs as follows : »In a long sequence of repetitions of 
iS, it i8 practically certain that the frequency of E will be approximately 
equal to P». 

Accordingly we must require that，whenever a theoretical deduction 
leads to a definite numerical value for the probability of a certain observable 
event，the truth of the corresponding frequency interpretation should be 
borne out by our observations. 


332 



26.2 


Thus e. g. when the probability of an event is very small, We must 
require that in the long run the event should occur at most in a very 
small percentage of all repetitions of tbe corresponding experiment. 
Consequently we must be able to regard it as practically certain that, 
in one single performance of the experiment, the event will not occur 
(cf 13.5). 一 Similarly, when the probability of an event differs from 
unity by a very small amount, we must require that it should be 
practically certain that, in one single performance of the corresponding 
experiment, the event will occur. 

In a great number of cases, the problem of testing the agreement 
between theory and facts presents itself in the following form. We 
have at our disposal a sample of n observed values of some variable, 
and we want to know if this variable can be reasonably regarded as 
a random variable having a probability distribution with certain ^iven 
properties. In some cases, the hypothetical distribution will be com¬ 
pletely specified: we may, e g., ask if it is reasonable to suppose 
that our sample has been drawn by simple random sampling from a 
population having a normal distribution with w = 0 and a = 1 (cf 17.2). 
In other cases, we are given a certain class of distributions, and we 
ask if our sample might have been drawn from a population having 
some distribution belonging to the given class. 

Consider the simple case when the hypothetical distribution is 
completely specified, say by means of its d f. F(x). We then have to 
test the statistical hypothesis that our sample has been drawn from a 
population with this distribution 

We begin by assuming that the hypothesis (o be tested is true. It 
then follows from 25.5 that the d. f. i 1 * (a?) of the sample may be 
expected to form an approximation to the given d. f. ■F(x)，when n is 
large. Let us define some non-negative measure of the deviation of 
F* from F. This may, of course, be made in various ways, but any 
deviation measure B will be some function of the sample values, and 
will thus according to 25 4 have a determined sampling distribution. 
By means of this sampling distribution, we may calculate the prob¬ 
ability P(D > D 0 ) that the deviation D will exceed any given quantity 
D 0 . This probability may be made as small as we please by taking 
D 0 sufficiently large. Let us choose JD 0 such that P(I) > D 0 ) == £, 
where £ is so small that we are prepared to regard it as practically 
certain that an event of probability e will not occur in one single 
trial. 

Suppose now that we are given an actual sample of ;/ values, and 

333 



26.2 


lei us calculate the quantity D from these valne®. Then if we find a 
value D > D 0 , this means that an event of probability e has presented 
itself. However, on our hypothesis such an event ought to be practically 
impossible in one single trial, and thus we must come to the conclusion 
that in this case our hjpotliesis has been disproved by experience. On 
the other hand t if we find a value D ^ 2> 0 , we shall be willing to 
accept the hypothesis as a reasonable interpretation of our data, at 
least until further experience has been gained in the matter. 

This is our first instance of a type of argument which is of a very 
frequent occurrence in statistical inference. We shall often encounter 
situations where we are concerned with some more or less complicated 
hypothesis regarding the properties of the probability distributions of 
certain variables, and it is required to test whether available statistical 
data agree with this hypothesis or not. A first approach to the pro¬ 
blem is obtained by proceeding as in the simple case considered above. 
If the hypothesis is true, our sample values should form a statistical 
linage (cf 25.5) of the hypothetical distribution, and we accordingly 
introduce some convenient measure D of the deviation of the sample 
from the distribution. By means of the sampling distribution of D y 
we then find a quantity D 0 such that P(D > D 0 ) — c, where e is deter¬ 
mined as above. If, in an actual case, we find a value D > D 0l we 
then say that the deviation is significant 、and we consider the hypo¬ 
thesis as disproved. On the other hand, when D ^ D 0 、the deviation 
is regarded as possibly due to random fluctuations, and the data are 
regarded as consistent with the hypothesis. 

A test of this general character will be called a test of significance 
relative fco the hypothesis in question. In the simple case when the test 
is concerned with the agreement between the distribution of a set of 
sample values and a theoretical distribution, we talk more specifically 
of # a test of goodness of fit The probability e t which may be arbitrarily 
fixed, is called the level of significance of the test. 

In a case when our deviation measure D exceeds the significance 
limit D 0t we thas regard the hypothesis as disproved by experience. 
This is, of course, by no means equivalent to a logical disproof. Even 
if the hypothesis is true, the event D> D 0 with the probability s 
may occur in an exceptional case. However, when e is sufficiently 
small, we feel practically justified in disregarding this possibility. 

On the other hand, the occurrence of a single value D ^ D 0 does 
not provide a proof of the truth of the hypothesis. It only shows 
that, from the point of view of the particular test applied, the agree- 

334 



26.2-3 


ment between theory and observations is satisfactory. Before a sta¬ 
tistical hypothesis can be regarded as practically established, it will 
have to pass repeated tests of different kinds. 

In Chs 30 一 31, we shall discuss various simple tests of signifi¬ 
cance, and give numerical examples of tbeir application. In Ch. 35, 
the general foundations of tests of this character will be submitted 
to a critical analysis. 

26.3. Description. — In 13.4, the applications of a mathematical 
theory were roughly classified under the headings Description、Analysis 
and Pr 0 diohon. There are, of course, no sharp distinctions between 
the three classes, and the whole classification is only introduced as a 
matter of convenience. We shall now briefly comment upon some 
important groups of applications belonging to the three classes 

In the first place, the theory may be used for purely deacnphoe 
purposes. When a large set of statistical data has been collected, we 
are often interested in some particular properties of the phenomenon 
under investigation It is then desirable to be able to condense tlu* 
information with respect to these properties, which may be contained 
in the mass of original data, in a small number of descriptive char¬ 
acteristics. The ordinary characteristics of the distribution of the 
sample values, sucli as moments, semi-invariants, coefficients of re¬ 
gression and correlation etc ， may generally be used with advantage 
for such purposes. The use of frequency-curves for the graduation of 
data, which plays an important part in the early literature of the 
subject，also belongs primarily to this group of applications 

When we replace the mass of original data by a small number of 
descriptive characteristics, we perform a reduction of the dat(t' according 
to the terminology of R A. Fisher (Ref. 13 ， 89). It is obviously im¬ 
portant that this reduction will be so arranged that as much as pos¬ 
sible of the relevant information contained in the original data is 
extracted by the set of descriptive characteristics chosen. Now the 
essential properties of any sample characteristic are expressed by its 
sampling distribution, and thus the systematic investigation of such 
distributions in Chs 27 一 29 will be a necessary preliminary to the 
working out of useful methods of reduction. 

In most cases, however, the final object of a statistical investiga¬ 
tion will not be of a purely descriptive nature. The descriptive char- 
acteri8tica will, in fact, usually be required for some definite purpose. 
We may, e g.，want to compare various sets of data with the aid of 

335 • 



26.3-4 


the characteristics of each set, or we may want, to form estimates of 
the values of the characteristics that we expect to find in future sets 
of data. In such cases, the description of the actual data forms only 
a preliminary stage of the inquiry, and wo are in reality concerned 
with an application belonging ： to one of the two following classes. 

26.4. Analysis. 一 When a mathematical theory has been tested 
and approved, it may be used to provide tools for a scientific analysis 
of observational data. In the present case we may characterize this 
type of applications by saying that we are trying to argue from 
the sample to the population. We are given certain sets of statistical data, 
which are conceived to be samples from certain populations, and we 
try to use the data to learn something about the distributions of the 
populations. A great variety of problems of this class occur in sta¬ 
tistical practice. In this preliminary survey, we shall only mention 
some of the main types which, in later chapters, will be more thor¬ 
oughly discussed. 

In 26.2, we have already met with the following type of problems: 
We are given a sample of observed values of a variable, and we ask 
if it is reasonable to assume that the sample may have been drawn 
from a distribution belonging to some given class. Are we, e.g” 
justified in saying that the errors in a certain kind of physical mea¬ 
surements are normally distributed ? Or that the distribution of in¬ 
comes among the citizens of a certain state follows the law of Pareto 
(cf 19.3) 9 一 In neither case the distribution of an actual sample will 
coincide exactly with the hypothetical distribution, since the former 
is of the discrete, and the latter of the continuous type. But are we 
entitled to ascribe the deviation of the observed distribution from the 
hypothetical to random fluctuations, or should we conclude that the 
deviation is significant^ i. e. indicative of a real difference between 
the unknown distribution of the population and the hypothetical 
distribution^ 

We have seen in 26.2 how this question may be attacked by means 
of the introduction of a test of significance. We then have to calculate 
a certain measure of deviation /), and in an actual case the deviation 
is regarded as significant, if D exceeds a certain given value D 0 » while 
otherwise the deviation will be ascribed to random fluctuations. 

In other cases, we assume that the general character of the distri¬ 
butions is known from earlier experience, and we require information 
as to the values of Bome particular characteristics of the distributiona. 

336 



26.4 


Suppose, e. g. f that we want to compare the effects of two different 
methods of treatment of the same disease, and let tis assume that for 
each method there is a constant probability of recovery. Are the two 
probabilities different? In order to throw light upon the problem, we 
collect one sample of cases for each method, and compare the two 
frequencies of recovery. In general these will be different, and we 
are facing the same question as in the previous case: Is the difference 
due to random fluctuations, or is it significant, i. e. indicative of a 
real difference between the probabilities? 

Similar, though often more complicated problems arise in many 
cases, e. g. in agricultural, industrial or medical statistics, when we 
want to compare the effects of various methods of treatment or of 
production. We are then concerned with the means or some other 
characteristics of our samples, and we ask whether the differences 
between the observed values of these characteristics should be ascribed 
to random fluctuations or judged to be si^rnificant. 

In sach cases, it is often asefal to begin by coDBideringf the hypo¬ 
thesis that there b 930 difference between the effects of the methods, 
bo that in reality all our samples come from the same population. 
(This is sometimes called the null hypothesis.) This being assumed, 
it will often be possible to work out a test of significance for the 
differences between the means or other characteristics in which we are 
interested. If the differences exceed certain limits, they will be regarded 
as significant，and we shall conclude that there is a real difference 
between the methods; otherwise we shall ascribe the differences to 
random fiuctuationH. 

This type of applications belongs to the realm of the statistical 
analysis of causes. Suppose, more generally, that we want to know 
whether there exists any appreciable causal relationship between two 
variables x and y that we are inTestigating. As a first approach to 
the problem, we maj then set up the null hypothesis, which in this 
case implies that the yariables are independent, and proceed to work 
out a test of significance for this hypothesis on the general lines 
indicated above. Suppose, e. g. t that we are interested in tracing a 
possible connection between the anDiial quantities x and y of two 
commodities consamed in a giyen g：roup of households. From a sample 
of observed ▼aluea of the two-dimensional variable (x t y), we may then 
calculate e. g. the sample correlation coefficient r. In general this 
coefficient will be different from sero, whereas on the null hypothesis 
the correlation coefficient q of the corresponding distribution is equal 

337 



26.4 


to zero. Is the difference significant, or should it be ascribed to random 
fluctuations? In order to answer this question, we shall have to work 
out a test of significance, based on the properties of the sampling 
distribution of r. If r differs significantly from zero, this may be 
taken as an indication of some kind of dependence between the vari¬ 
ables. The converse conclusion is, however, not legitimate. Even if 
the population value q is equal to zero, the variables may be dependent 
(cf 21.7). 

Various tests of significance adapted to problems of the general 
character indicated above will be treated in Chs 30—31. The test of 
significance to be applied to a given problem may always be chosen 
in many different ways. It thus becomes an important problem to 
examine the principles underlying the choice of a test, to compare 
the properties of various alternative tests and, if possible, to show 
how to find the test that will be most efficient for a given purpose. 
Questions belonging to this order of ideas will be considered in Ch. 35. 

In a further type of problems of statistical analysis it is required 
to use a set of sample values to form estimates of various characteris¬ 
tics of the population from which the sample is supposed to be drawn, 
and to form an idea of the precision of such estimates. The sinipleat 
problem of this type is the classical problem of inverse probability : 
given the frequency of an event E in b, sequence of repetitions of & 
random experiment, what kind of conclusions can be drawn with 
respect to the unknown value of the probability p of E? It is fairly 
obvious that in this case the observed frequency ratio may be taken 
as an estimate of p y but will it be possible to measure the precision 
of this e3timate f and even to make some valid probability statement 
concerning the difference between the estimate and the unknown »true 
value» of p? — A more complicated problem of the same character 
arises in the theory of ei^ors, where we have at our disposal a set of 
measurements on quantities connected with a certain number of un¬ 
known constants, and it is required to form estimates of the values 
of these constants, and to appreciate the precision of the estimates. 
Similar problems occur in connection with the method of multiple 
regression 、 which is of great importance in many fields of application. 
In certain economic problems, e. g., economic theory leads us to assume 
that there exist certain linear or approximately linear relations be¬ 
tween variables connected with consumers’ incomes, prices and quantities 
of various commodities produced or consumed in a given market. 
When a set of observed values of these variables are available, it is 

338 



26.4-5 


then required to form estimates of the »elasticities* or similar quanti¬ 
ties that appear as coefficients in the relations between the variables. 

A general form of the estimation problem may be stated in the 
following way. We consider a random variable (in any number of 
dimensions), the distribution of which has a known mathematical form, 
but contains a certain number of unknown constant parameters. We 
are given a sample of observed values of the variable, and it is required 
to use the sample values to form estimates of the parameters, and to 
appreciate the precision of the estimates. In general, there will be an 
infinite number of different functions of the sample* values that may 
be used as estimates, and it will then be important to compare the 
properties of various possible estimates for the same parameter, and 
in particular to find the functions (if any) that yield estimates of 
maximum precision. Farther, when a system of estimates has been 
computed, it will be natural to ask if it is possible to make some 
valid probability statements concerning the deviations of the estimates 
from the unknown »true values* of the parameters. Problems of this 
type form the object of the theory of estimation, which will be treated 
in Chs 32 — 34. 一 Finally, some applications of the preceding theories, 
will be given in Chs 36 — 37. 

26.5. Prediction. 一 The word prediction should here be understood 
in a very wide sense, as related to the ability to answer questions 
such as: What is going to happen under given conditions? 一 What 
consequences are we likely to encounter if we take this or that pos¬ 
sible course of action ? — What course of action should we take in 
order to produce some given event? 一 Prediction, in this wide sense 
of the word, is the practical aim of any form of science. 

Questions of the type indicated often arise in connection with 
random variables. We shall quote some examples : 

What numbers of marriages, births and deaths are we likely to 
find in a given country during the next year? 一 What distribution 
of colours should we expect in the offspring of a pair of mice of 
known genetical constitution? — What effects are likely to occur, if 
the price of a certain commodity is raised or lowered by a given 
amount? 一 Given the results of certain routine tests on a sample 
from a batch of manufactured articles, should the batch be a) destroyed, 
or b) placed on the market under a guarantee? 一 How should the 
premiums and funds of an insurance office be calculated in order to 
produce a stable business? — What margin of security should be 

339 



26.5 


applied in the planning of a new telephone exchange in order to 
reduce the risk of a temporary overloading within reasonable limits ： 
If we suppose that we know the probability distributions of the 
variables that enter into a question of this type, it will be seen that 
we shall often be in a position to give at least a tentative answer to 
the question. A full discussion of a question of this type, however, 
usually requires an intimate knowledge of the particular field of 
application concerned. In a work on general statistical theory, such 
as the present one, it is obviously not possible to enter upon such 
discussions. 


340 



Chapters 27-29. Sampling Distributions. 


CHAPTER 27. 

Characteristics of Sampling Distributions. 

27.1 Notations. 一 Consider a one dimensional random variable g 
with the d. f. F (a?). For the moments and other characteristics of the 
distribution of f we shall use the notations introduced in Ch. 15. 
Thus m and a denote the mean and the variance of the variable, while 

fi v and v. v denote respectively the moment, central moment and 
semi-invariant of order v. We shall suppose throughout, and without 
further notice, that these quantities are finite, as far as they are 
required for the deduction of our formulae. 

By n repetitions of the random experiment to which the variable § 
is attached, we obtain a sequence of n observed values of the variable - 
X t . x it .. M As explained in 25.2, we shall in this connection use 
a terminology derived from the process of simple random sampling, 
thus regarding the set of values x u .. .,x n as a sample from a popula¬ 
tion 8}>ecified by the d. f. F(x). The distribution of the sample is ob¬ 
tained (cf 25.3) by placing a mass equal to 1/n in each point and 
the moments and other characteristics of the sample are defined as the 
characteristics of this distribution. 

In all investigations dealing with sample characteristics, it is most 
important to use a clear and consistent system of notations. In this 
respect, we shall as far as possible apply the following three rules 
throughout the rest of the book: 

1. The arithmetic, mean of any number of quantities such as x t1 ..ar n 
or ff i% .. .yk will he denoted by the con'espondtng letter with a bar: x or y. 

2. When a certain characteristic of the population (i. e. of the distri¬ 
bution of the variable §) is ordinarily denoted by a Greek letter、the 
corresponding characteristic of the sample will be denoted by the corre- 
sponding italio letter: s 1 for <7 8 , a v for a Vl etc. 

3 . In cases not covered by the two preceding rules tve shall usually 
denote sample characteristics by placing an asterisk on the letter denoting 

341 




27.1 


the corresponding population characteristic^ thus writing e. g, F* (ar) for 
the d. / of the sample, which corresponds to the population d.f. F (a:). 

Thus the mean and the variance of the sample are (cf 25.3) 
(27.1.1) ^ = i 2^, • 5’ = 主 2( 怎 * 一龙) *， 

n i n t 

where the summation is extended over all sample values : t = 1, 2,.. n. 
The moments a v and the central moments tn v of the sample are 

(27.1 2) a, = w - ~ ~2 

The coefficients of skewness and excess of the sample are. in accord¬ 
ance with (15 8.1) and (15.8.2), 

(27.1.3) = 

The relations (15.4.4) between the moments and the central moments 
hold true for any distribution; thus in particular they remain valid 
if m, a v and fi v are replaced by the corresponding sample character¬ 
istics a t and m v 

For the d f. of the sample, we have already in (25.3.1) introduced 
the notation F* (x). Similarly the c. 1 of the sample is 1 ) 

ao 

(27.1 4) (i) = J e itx W(x)=^2e ‘、， 

— 00 * 

and the semi-invariants of the sample are thus according to (15.10.2) 
defined bv the development 2 ) 

(27 1.5) log W) = 2^(i<)’. 

1 

All moments and semi-invariants of the sample are finite, and the 
relations (15.10.3) 一 (15.10.5) between moments and semi-invariants 

l ) When there is a possibility of confusion, shall use a heavv-faeed i to denote 
the imaginary unit. 

*) At this point our notation differs from the notation of R. A. Fwber (Ref. 】 3), 
who uses the symbol to denote the unbiased estimate of x v ^hioh, in onr nofca 
tion，is denoted by K v (cf 27.6) 


342 



27.1 


hold true when the population characteristics are replaced by sample 
characteristics. 


The same rules will be applied to samples from multi dimensional 
populations. Thus e. g. if we are given n pairs of observed values 
(x,, y,), • . . ， (x n , y»v) from a two-dimensional distribution, we write (cf 
21 . 2 ) 


(27.1.6) 


n ^ w ^ 

m 20 = s: = 去 2 ( 而一 £ ) 2 y 

m n = rs t .^2 = - 2 ~ 无 )( 扒 一 乡)， 

n i 

^ot = ^ ^ ^ 2 (y* — 


In particular, the quantity r defined by the relation 


(27.1.7) 



s x s t 


is the correlation coefficient of the sample, which corresponds to the 
correlation coefficient q of the population. Since r is the correlation 
coefficient of an actual distribution (viz. the distribution of the sample), 
it follows from 21.2 that we have — 1 ^ r ^ 1. The extreme values 
尸=士 1 can only occur when all the sample points (x,, y,) are situated 
on a single straight line. 

For a sample in more than two dimensions, we use notations de¬ 
rived according* to the aJbove rules from the notations introduced in 
Chs 22 — 23. Thus e. g. we denote by s t the s. d. of the sample values 
of the i:th variable, while r tJ is the correlation coefficient between 
the sample values of the rth and the j.th variable. We further write 
R for the determinant | rtj |, and denote the regression coefficients, 
the partial correlation coefficients etc. of the sample by symbols such 
as (cf 23.2.3 and 23 4.2) 

^12 34 

y\2 14 


— S \ 丑 12 
* = - ， 

-**11 

y R n 

343 



27,1 


where k is tbe number of dimensioDB, while the R{j are the cofactors 
of R. As before, all relations between the characteristics deduced in 
Part II hold true when the population characteristics are replaced by 
sample characteristics. 

We now come back for one moment to the one-dimension 氤 1 case. 
According to 25.4 ， any characteristic g{x it .. . y x n ) of an actual sample 
may be regarded as an observed value of a random variable g{x u .. . ，办)， 
where x lt .. . y Xn are independent variables, all having the same dis¬ 
tribution as the original variable The distribution of the random 
variable ff(x l , .., a: n ) is called tbe sampling distribution of the charac¬ 
teristic g[x Xy ..x n ). Thus we may talk of the sampling ： distribution 
of the mean of the variance etc. 

The same remarks apply to samples in any number of dimensions. 
Any sample characteristic may be regarded as an observed value of 
a certain random variable, the distribution of which is called the 
sampling distribution of the characteristic. Thus we may talk of the 
sampling distribution of the correlation coefficient r, of the correlation 
determinant i?, etc. 

For any sample characteristic g, we may thus consider its sampling 
distribution, and calculate the moments, semi-invariants etc. of this 
distribution. As usual (cf 15.3 and 15.6) we employ in such cases the 
symbols E(g) and D(g) to denote the mean and the 8. d. of the ran¬ 
dom variable g — g(x l} .. x n ). Further, when we are concerned with 
some characteristic of the g distribution (such as a central moment, a 
semi-invariant etc.), which has been given a standard notation (such as 
fi, or x，）in Ch. 15， we shall sometimes use the standard symbol of 
this characteristic, followed by the corresponding random variable 
within brackets. Thus we shall ^rite e. g. for tbe central moment of 
order v of the sample characteristic g — g(x x ,. .., 

fi v (g) = E(g — E(g)) v . 

Similarly, when two sample characteristics/^, .. .,x v ) and g(x ly . . ” x”) 
are considered simultaneously, the correlation coefficient of their joint 
sampling distribution will be denoted by 


Q (/ 9) 




Whenever we are concerned with sampling distributions connected 
with a given population, it should always be borne in mind that the 

344 



27.1-2 


sample characteristics (x, s, nu, Jc ¥t r etc.) are conceived as random van- 
ables，while the population characteristics (tw, a, x v , q eic.) are fixed 
(though sometimes unknown) constants. 

27.2. The sample mean x. — Consider a one-dimensional sample 
with the values ^,.. ., a: n . Regarding the ar< as independent random 
variables, each having the d. f. F (x), we obtain 

E(x) — — ^E (xj = w, 

(27.2.1) * 

n t n 

Thus the random variable a ： = — S rr, has the mean m and the variance 

n 

fijn, i. e. the s. d. o\V^n. It then immediately follows from Tcheby- 
cheff’8 theorem 20.4 that the sample mean x converges in probability 
to the population mean m % as n tends to infinity. 1 ) 

Writing f — m = — S — m), and bearing in mind that the Xi are 
n 一 

independent, and that any difference x t — m has the mean value zero, 
we further obtain 


(27.2.2) 


Ms ⑷ —E(x — m) s = 





(z) = E(x — m ) 4 = ^ E (^» — m ) j 

—2 ^ — w) 4 + ~ 2 ^ ~ ( X J ~ Wl )*) 

n t n KJ 

= ^ + 3{w- 1 l) M + 〜二 3〆 


The higher central moments of * may be found by similar, though 
somewhat more tedious, calculations. Thus we find 

*) By the less elementary Khintchine s theorem 20.5, it follows that this property 
holds as soon as the population mean m exists, even when /i 8 is not finite. 

345 



27.2-3 


〜⑼ = E(£ —• mf = —+ o (S ) ， 

…⑻ = 五(龙一 w )* = + 0 j » 

and generally 

(27.2.3) E{x- tn) 2 *- 1 E (龙一 m) 3 * =o(^)- 

In the important particular case when the distribution of the popula¬ 
tion is normal (m, a), it has been pointed out in 17.3 that 龙 is also 
normal, with mean m and s. d. a/Vn . It follows that in this case any 
fiv{x) of odd order is zero, while the three first central moments of 
even order reduce to 

M»(i) = D'(*) = T'> A* ‘㈤ = /*«(*)= 

n n n 


27.3. The moments a,. — For any sample moment a v = —2 x\ we 
obtain, in direct generalization of (27.2.1) and (27.2.3 )， 

E(a v ) = ^( x i) = 

n i 

(27-3.1) ,' , 

E(a.~ a,) 3 *- 1 = 0(5) ， E{a,- a,) 51 = 0 ⑸. 


Bj Khintcbine s theorem 20.5 it follows from the first of these rela¬ 
tions that, as soon as the population moment a v exists, the sample 
moment a v converges in probability to a v , as n-*co. 

It now follows from the corollary to theorem 20.6 that any rational 
function y or power of a rational function、of the sample moments a v con¬ 
verges in probability to the constant obtained by substituting throughout 
a v for a t1 provided that all the a，occurring in the resulting expression 
exist、and that the constant thus obtained is finite. 

Hence in particular the central moments m V) the semi-invariants k v 

346 



27.3-4 


and the coefficients g x and g % defined by (27.1.3) all converge in prob¬ 
ability to the corresponding population characteristics, as w — oo • In 
large samples, any of these sample characteristics may thus be re¬ 
garded as an estimate of the corresponding population characteristic. 
We shall, however, later find that the estimates obtained in this way 
are not always the best that we can obtain (cf 27.6 and 33.1). 

Any mean value o£ the type 

<27.3.2) = 

where p, q” .. are integers, can be obtained by straightforward, though 
often tedious, algebraical calculation. We have only to use the fact 
that the oc% are independent variables such that E(x 9 t ) = a v . — In the 
particular case when the population mean m is equal to zero, a v co¬ 
incides with the central moment #，• If the sample mean a x = x occurs 
among the factors in (27.3.2), the calculations are in this case simpli¬ 
fied, since any term containing one of the Xi in the first degree has 
then the mean value zero. 


27.4. The variance m t . 一 Any central sample moment m，= 
=—(xi — z) r is independent of the position of the origin on the 

n i " 

scale of the variable. Placing the origin in the mean of the popula¬ 
tion, we have m = 0. When we are concerned with the samplingf 
distributions of the m ，、we may thus always suppose m = 0, and so 
introduce the simplification mentioned at the end of the preceding 
paragraph. The formulae thus obtained will hold true irreapective of 
the value of m. 

We accordingly suppose m = 0, and consider the sample variance 

m^ — s 1 = — S (^ — ^)* == a x — x l . By (27.2.1) and (27.3.1) we have, 
n 

since m = 0, 

(27.4.1) E(w,) = E(a t ) - £(r*) = ^ ^ 

u n 

We further have = aj — 2 x* fl, + z A . Assuming always m =0, we 
find 


347 



27.4 


E(a\)^^E{^x^ 

[(2 而 ) 


弘 4 + (w — 1 ) 弘； 


綷 4 + (” 一 

n % _ ， 


E(x A ) : 




E 




^±i|piK 


and hence after reduction 


E (m 卜 "; + + , 

n n n 

(27.4.2) D*(m 8 ) = £(w;) — E % (m t ) 

A — W 2( 卩 4 一 2 ";) fx 4 —3 fi] 

=—^ - + — ^ 厂 - 


The higher central moments of m % may be obtained in the same waj. 
The calculations are long and uninteresting, but no difficulty of prin¬ 
ciple is involved. We give only the leading terms of the third and 
fourth moments : 


, 、 „( n — 1 \ 3 *|tt 0 — 3jU f jU 4 — 6^ + 2/I/ 1 \ 

叫 (m,) = £(m, - —= f 」-—- .. .... - +»(-,)- 

(27.4.3) 

〜(”,,)=£ ( wi , —= 3 ( 〜 - 杓) +0 (士) . 


We shall finally consider the covariance (cf 21.2) between the mean 
x and the variance m, of the sample. For an arbitrary value of fn r 
this is 


fi n [x, m,) = E (( 无一； w) (m, — = E (( 名一 w) m^). 


Since the last expression is clearly independent of the position of the 
origin, we may again assume m = 0, and thus obtain by calculations 
of the same kind as above 


(27.4.4) 


fi n m,) = E [^m t ) = E(xa t ) — E (^*) 



348 



27.4-5 


For any symmetric distribution, we have == 0, and thus i and w/ 2 
are uncorrelated. We shall see later (cf 29.3) that, in the particular 
case of a normal population, J and m, are not only uncorrelated, but 
even independent. For & normal population, (27.4.1) and (27.4.2) give 

(27.4.5) = D* (m.) = ff 4 

" n 


27.5. Higher central moments and semi-invariants. 一 The ex¬ 
pressions for the characteristics of the sampling distributions of m 9 
and Jc v are of rapidly increasing complexity when v becomes greater 
than 2, and we shall only mention a few comparatively simple cases, 
omitting details of calculation. For further information, the reader 
may be referred e. g. to papers by Tschuprow (Ref. 227) and Craig 
(Ref. 67). 一 


By calculations of the same kind as in the preceding paragraphs, 
we obtain the expressions 


(27.5.1) 


£K) j 二警 - 2 ) 〜 


£K) : 


(« — 1)(m’ 一 3m+ 3) 3 (n — 1)(2w — 3), 

w ， —- 〜十 — f *,. 


For any m v we have 


(27.5.2) = ~^{x t — z) 9 = a, — 4- x s — • •.. 


As before, we may suppose m = 0, so that E(a„) == and 


For l < i ^ v, we have by (27.2.3) and (27.3.1), using the Schwarz in 
equality (9.5.1), 

E(x^)E(al^) = 0 (^)， 

so that JB (: f* a，. 一 •） = 0 (” 2 )，and (27.5.2) gives 

349 



27.5 


(27.5.3) E (m v ) = p，+ 0 ( 士 ). 

Further, bj (27.5.2) any power of m v — fi v is composed of terms of the 
form ^(a ¥ — fi v y a*,.. • ， and it is shown in the same way as above that 

the mean value of such a term is of the order n 3 Thus in order 
to calculate the leading term of E(m P — fj 9 ) k t it is sufficient to retain 
the terms 

一 •“，= a v 一 fi, 一 (;) xa ¥ -\, 

while all the following terms of (21 5.2) give a contribution of lower 
order. For k = 2 we obtain in this way, since by (27.5.3) the diffe¬ 
rence E (m t ) — fi v is of order w - * 1 , 

(27.5.4) D*K)= + o(i). 

Generally we obtain for any even power of wi ， 一 / “ 

(27.5.5) £(〜-〜)“= 0(^- 

The mean value of a product (m ， 一 fi ¥ ) — fi 9 ) may be calculated in 
the same way, and we thus obtain, usin^ again (27.5.B), the following* 
expression for the covariance between m y and m„. 

(27.5.6) m k )= 

二 蜘 — — 一 i 一小蜘 + + 0 / l ^\ 

n \n 9 / 

The expressions of the first semi-invariants h of the sample are 
obtained by substituting in (15.10.5) the sample moments m，for the 
population moments fi v . We obtain 

k x = 夕， k % = m„ = m 8 ， k 4 = wi 4 — 3 w J. 

We may then deduce expressions for the means and yariances of the 
k v by means of the formulae for the wa，given above. In particular we 
obtain in this way, expressing E(k v ) in terms of the population semi¬ 
invariants 


350 


27.5-6 


®(A: 1 ) = x 1 , 

E(k t ) = ^~-x t , 

_. 7 > 

E(k t ) = U^ ll!L±_ 6 j % — 小 

27.6. Unbiased estimates. — Consider the sample variance wi 2 == 

2(a^~ x)*. According to 27.3 ， m, converges in probability to the 
n " 

population variance fi f as w oo , and for large values of n we may 

thus use m % as an estimate of fA f . In the terminology introduced by 

R. A. Fisher (Ref. 89, 96), an estimate which converges in probability 

to the estimated value, as the size of the sample tends to infinity, 

is called a consistent estimate. Thus m t is a consistent estimate of ju 2 

On the other hand, it is shown by (27.4.1) that the mean value 

” 1 

of wi, is not but ―-— Thus if we repeatedly draw samples of 

a fixed size n from the given population, and calculate the variance 
w a for each sample, the arithmetic mean of all the observed w s values 
will not converge in probability to the »true value* fi f1 but to the 
~1 

smaller value - fi t . As an estimate of jli 9i the quantity. vi t is 

n 

thus affected with a certain negative bias 、which may be removed if 
we replace m t by the quantity 

w, = 之 ( 斯一龙 )* 
n—\ n — \ m — 1 一 

w 

We have, in fact, E (3f f ) = - - E (wi f ) = and accordingly M t is 

n 一 1 

fi 

called an unbiased estimate of u s . Since the factor - r tends to unity 

as n — oo，both M t and m t converge in probability to |i t , so that M t 
i® consistent as well as unbiased, while m 9 it consistent, but not un¬ 
biased. 

Similarly, by 27.3, any central moment m 9 or aemi-in ▼ 鼴 ri&nt A** of 
the sample is a consistent estimate of the corresponding ： fi 9 or x 9t 

351 



27.6-7 


but it follows from (27.5.1) and (27.5.7) that for v> X these estimates 
are not unbiased. As in the case of m f we may, however, by simple 
corrections form estimates which are both consistent and unbiased. 
Thus we obtain for y = 2, 3 and 4 the following corrected estimates 
of fiv and x r ： 


tlt n 

Mt= ° , 

K ― n (w* 一 2 n + 3) 3n(2n 一 3) t 

Mi=Z (n- l)(«-2)(»-3) m *~ \n- l)(n-2)(»-l) W, , 

and 

v n 

- - -m 9t 

n 一 1 

w* 

尤 * ""(m-IM” 一 2) 叫， 

( w -l)(n^ 2j( W '-3j [{w + D ^ ~ 3 (» - 1) m；] 

By means of the formulae given in the two preceding paragraphs, it 
is easily verified that in all these cases we have E (M 9 ) = fi 9 and 
E (K ¥ ) = x». For large values of n t it is often indifferent whether we 
use M, and K,，or m, and 1c 9t but for small n the bias involved in 
the latter quantities may be considerable. — We shall return to 
questions connected with the properties of estimates in Ch. 32. 

We have seen in the preceding paragraphs that the algebraical 
process of working out formulae for the sampling characteristics of 
the quantities m v and becomes very laborious, as soon as we leave 
the simplest cases. It has been discovered by R. A. Fisher (Ref. 99). 
who has introduced the quantities K 9 (which he denotes by lc ¥y cf foot¬ 
note p. 342), that the corresponding calculations for the K v may be 
considerably simplified by means of combinatorial methods. These 
methods have been further developed by Fisher himself, Wisbart and 
others. A good account of the subject has been given by Kendall 
(Ref. 19), who gives numerous references to the literature. 

27.1. Functions of momenta. — It often occurs that the mean and 
the variance of Borne function of the sample momenta are required. 

352 



27.7 


When the function is a polynomial in £ and the central moments w ，， 
the problem can be solved by the method developed in 27.3 一 27.5. 
Even when fractional powers are involved, we may often use a similar 
direct method. Consider e. g. the simple example of the standard 
deviation s=^Vm t of the sample. We have identically 

(m, _ 弘 ，)* 




w, — fi t 


2^ (Vm, + K/T,)- 


By (27.4.1), the first term in the second member has a mean value of 

(»». — M .) 1 


order n~ l . The last term is smaller in absolute value than 




and thus by (27.4.2) and (27.4.1) its mean value is also of order w _1 . 
Thus we obtain 

(27.7.1) E{Vm t )=-VJ t + 

By a similar calculation we obtain 

(27.7.2) D* (V ^,) - 〜二 




+ 




In many cases, however, we are concerned with functions involving 
ratios between powers of certain moments, such as the coefficients g x 
and g t1 the coefficient of correlation etc. We shall give a theorem 
that covers the most important of these cases. The theorem will be 
stated and proved for the case of a function H{m Vl m p ) of two central 
moments w, and but is immediately extended to any number of 
arguments, including ： also the mean a\ The case of a function of one 
single argument is, of course, included as the particular case when 
the function is independent of one of the two arguments. The theorem 
also holds, with the same proof, for functions of moments of multi¬ 
dimensional samples (cf 27.8). 

Consider a function H(m v , m 9 ) which does not contain n explicitly. 
We may regard H either as a function of the two arguments m 9 and 
m Q or, replacing m v and by their expressions in terras of the sample 
values, as a function of the n variables a?j,. .., x n . In the latter case 
the function may, of course, contain n explicitly. 一 We shall now 
prove the following theorem : 

Suppose that the two following conditions are satisfied : 

1) In some neighbourhood of the point m v = m ( = the function 
H is continuous avd has continuous derivatives of the first and second 
order with respect to the arguments and m Q . 

353 




27.7 


2) For all possible values of the tee have \H\ < Cn p , where C and 
p are non-negative constants. 

Denoting by H 0l and the values assumed by the function 

If [m ，， m e ) and its first order •partial derivatives in the point m v = fi 9> 
m Q = fi Q , the mean and the variance of the random variable H(rn v , w 穿 ) 
are then given by 


(27.7.3) 


E ⑻=丑。+ 0⑸， 

D*(Af)=/i 2 (w*) H] + 2.u n (m y , m e ) H x H. H- i?] + 0 (☆•). 


By (27 6 4) and (27.6 6), the variance of H is thus of the form c/n + 0(n-'/»), 
where c is constant. —— The proofs of these relations found in the literature are often 
unsatisfactory. The condition 2) os given above may be considerably generalized, but 
some condition of this type is necessary for the truth of the theorem. In fact, if we 
altogether omit condition 2)，it would e. g. follow that, for any population with 
t u t > 0, the function l/m 2 would have a mean value of the form 1/〆，+ OCn - * 1 ). This 
is，however, evidently false. The mean of l/w 2 cannot be Unite for any population 
with a distribution of the、discrete type, since we have then a positive probability 
that m* = 0. It i8 eaay to show that similar contradictions may arise even for con* 
tinuous distributions. 

In 28.4, it will be proved that the fanctiou H(m, ， wi p ) is asymptotically nor、 
mally distributed for large values of n. It ia interesting to observe that, in this 
proof, no condition corresponding to the present condition 2) will be required. 

Let P(S) denote the pr. f. of the joint distribution of x lt x ,， …， x n . 
P(S) is a set function in the space R n of the Xi. If, in Tchebjcheff s 
theorem (15.7.1), we take g (|) = (m v — it follows from (27.5.5) 
that we have for any f > 0 

A 


or 


P [(m v ― fi v ) 2k ^ € 2k ] < 


P [ I 州，一 ^ < 


£ 3 * n k 


A 


where A is a constant independent of n and e. The corresponding 
result holds, of course, for m 9 . Denote by Z the set of all points in 
R n such that the inequalities \m v 一 /i y | < £ and | m p 一 I < € are both 
satisfied, while Z* is the complementary set. We then have, according ： 
to the above, 


(27.7.4) 




P(Z)> 1 


2A 


354 



Now 


27.7 


E(H) = f HdP + / HdP, 

7. %• 


and by condition 2) the modulus of the last integral is smaller than 
^ Choosing A: > j> + 1 ， it follows that 


(27.7.5) 


E(H)= / HdP + 


0 ⑸. 


If £ is sufficiently small, we have by condition 1) for any point in 
the set Z 

H [m v% m e ) = //<, + //, (w? v — fx v ) 4 - H t (m e — t u e ) -f R, 

(27.7.6) 

R= i[Hh(m t —/i.)* + 2 Hn {m v — ti v ) (?w p — n Q ) + Hn (r» f — 

where the H\j denote the values of the second order derivatives in 
some intermediate point between (/i v , fi^) and (m Vl m 9 ). Hence 

f HdP= H 0 P(Z) + //, / (m, - ij.,) dP + 

(27.7.7) 2 / 

+ H^{m^-^)dP + f RdP. 

z / 

Consider now the terms in the second member of the last relation. 
By (27.7.4), the first term differs from 丑 q by a quantity of order n^ k t 
which is smaller than w -1 , since k > p + 1 ^ 1. The two following 
terms are at most of order n - l , since H x and H t are independent of 
”， and we have by (27.5.3) and (27.5.5)，using the Schwarz inequality 
(9.5.1), 

f (m r — fi P ) dP — E(nu ~ fi v ) — f (m r — fx t )dP 

2 /• 

= 0 ( 士 ) - J (my~^)dP, 


站 ft 



27.7 


f(m,-fi,)dP S I J (rrK-^YdP- [dp]* 

g [£(”, ， -" ， P(Z*)]*= 0 d ， 

and similarly for the term containing m Q . Finally, by condition 1) the 
derivatives R\j are bounded for all sufficiently small *, and it then 
follows in the same way that the last term in (27.7.7) is also of order 
w"" 1 . Hence the first member of (27.7.7) differs from H 0 by a quantity 
of order w 1 , and according to (27.7.6) we have thus proved the first 
relation (27.7.3). 

In order to prove also the second relation (27.7.3), we write 

E(H - H o y = f (H - H o y d P + j(H- H o y dP. 
z z* 

Choosing now it > 2p + }, we obtain by meant of condition 2) and 
the first relation (27.7.3) jait proved 

D*(H) = f(H-H 0 ) i dP+ 0(n~i). 
z 

We then express (H — H 0 Y by mean■ of the development (27.7.6)，and 
proceed in the same way as before. The calculations are quite similai 
to those made above, except with rc«pect to the terms of the type 

f (nir — fi v )RdP, where we have, e. g. } using (15.4.6) and (27.5.5), 

x 

1/ Hn(m,-n，Y dP\<KE(\ m ,- f i,\ t )^K(E(m,~ ^,)*) S = 0 (»_” • 
z 

This completes the proof of the theorem. 

We shall now apply the relations (27.7.3) to some examples. Con 
aider first the coefEicientt of skewness and excess of the sample : 

9i== ^' 办 = 3_ 3 ‘ 

As soon as Mi > 0, these functions satisfy condition 1). In order t( 
show that condition 2) is also satisfied, we write 


356 



27.7 


yrr S (Xi — i) s _ A/ r- (x f — ； f ) 8 

= Vl> (Sto-W ==V，， t j (2^-.r)T* ’ 


and hence infer 






S (xj - a) 2 


In a similar way it is shown that \g f \<n for all n > 3. Thus we 
may apply (27.7.3) to find the means and the variances of ^ and g t . 
From (27.5.4) and (27.5.6) we find, to the order of approximation given 
by (27.7.3), 

= E ^ 

4 jM* - 12 |i 8 |i 8 - 24 |«； ^4 + 9 fil fi 4 + 36 + 36 n\ 

(27.7.8) DW = - ’ 

nu v iu» jtfg-4 |/ 3 .«j ^ 4 n\-n\ ^ fil a<4+ • 

U ^ ffi) ~ ju, n 


When the parent population is normal, these approximate expressions 
reduce to 


(27.7.9) 


五(夕 i ) = E ( A ) 二 0 ， 

A 9X 

D* ⑹ =1 D»W= w - 


The exact expressions for the normal case will be given in (29.3.7). 
As our next example we consider the ratio 


V- 




which is known as the coefficient of variation of the sample. When 
the population distribution is such that the variable takes only posi¬ 
tive values 、we have 




n x* 


(Sx,) a 






w ， 


so that we may apply (27.7.3), replacing, in accordance with the re¬ 
mark made in connection with the theorem, tn* by By (27.2.1), 


357 



27.7-8 


(27.4.2) and (27.4.4) we then obtain, to the order of approximation 
given by (27.7.3), 

(27.7.10) 

D ，（门 — — 一 + 4 / it ] 

4 m 4 /i, w 

A normal population does not satisfy the condition that the variable 
takes only positive values, and it is easily seen that for such a po¬ 
pulation V is not bounded, so that condition 2) is not satisfied. We 
may, however, consider a normal distribution truncated at a; = 0 (ef 

19.3)，and when — is fairly small, the central moments of such a dis- 
tn 

tribution will be approximately equal to the corresponding mobients 
of a complete normal distribution. In this case, the approximate ex¬ 
pression for the variance of V reduces to 

(27 - 7 - n) d ，卜丄 ( i +2 5). 

27.8. Characteristics of multi -dimensional distributions. 一 The 
formulae for sample characteristics deduced in 27.2 一 27.6, as well as 
the theorem proved in 27.7, may be directly extended to the character¬ 
istics of multi*dimen8ionaI samples. The calculations are quite similar 
to those given above, and we shall here only quote some formulae 
relating to the two-dimensional case. The definitions of the symbols 
used below have been given in 27.1, and we assume throughout that 
all the requisite moments are finite. 一 We have 

Einiik)^ fat + 0 ⑸， 

E (mj = Mu ， D* (m n 、= 4- 0 » 

銲 u(Wto, Wot) = ^ — + 0 ( 士)， 

AinK， Wf0 ) = ^ + O^Y 


358 



27.S-9 

The sample correlation coefficient 

m u 

r = —— 

y rn^m^ 

obviously satisfies the conditions of the theorem of 27.7, since we 
have I r| ^ 1. Denoting by q the population value of the correlation 
coefficient, we then obtain by means of the relations given above, to 
the order of approximation given by (27.7.3), 

E (r) = Q, 

(27.8.1) D ，(和 Z 卜 + 心 + 一 JLfflL. 一 

4 n Vu?o 芦 to/w 0 t f^ti ^io f^n f^otf 

For a normal population, the expression for the variance reduces (cf 
Ex. 3, p. 317) to the following expression, which is correct to the 
order n -’ ft ， 

(27.8.2) = 

We finally observe that the theorem of 27.3 on the convergence in 
probability of sample characteristics holds true without modification 
in the multi dimensional case. Thus e. g. r converges in probability 
to Q y while the partial correlation coefficient rn s*. * of the sample 
converges in probability to 泛 12 34 .. Jt ， etc. 

27.9. Corrections for grouping. — In practice samples are very 
often grouped (cf 25.5). Suppose that we draw a sample of n from a 
ODe-dimensional distribution of the continuous type, with the fr. f. 
f{x), and let the sample values be grouped into intervals of length fc, 
with the mid points 备 = 5o + ^， where = 0 ，土 1， ± 2, .... In such 
cases it is usual to assume, in calculating the moments and other 
8ampie characteristics, that all sample values belonging to a certain 
interval fall in the mid point of that interval. We are then in reality 
sampling from a distribution of the discrete type, where the variable 
may take any value & = 5o + t ft with the probability 

^i h 

Pi= f f{x)dx. 


The moments etc. that we are estimating from our sample character- 

359 



27.9 


istics according to the formulae previously given in this chapter, are 
thus the moments of this »grouped distribution» : 


s，= 


However, in many cases it is not these moments that we really want 
to know, but the moments of the given continuous distribution : 


or,，= f x v f(x) dx. 


Consequently it becomes important to investigate the relations be¬ 
tween the two sets of moments. It 汧 ill be shown that, subject to 
certain conditions, approximate values of the moments a % may be 
obtained by applying certain corrections to the raw or grouped mo¬ 
ments d，.. 

The raw moments may be written 


<*» = 2 ^ f = 2 夕⑹， 

-* 一 °° 

where ^ = ^ () + i/i, and 

(27.9.1) = r / f(x)dx. 


From the EulerMacLaurin sum formula (12.2.5) we then obtain, 
assuming f{x) continuous for all x, 


(27.9.2) 


00 + 

汰 v = f ( 运 o + h/) ,( h f f{^)dx -f R y 

OD 

R = —hf P 1 (!/)g'(^ 0 + hy)dy. 


Let us assume for the moment that the remainder R may be neg¬ 
lected. We then obtain, reverting the order of integration, 

360 



27.9 


Thus the grouped momenta may be expressed as linear functions 
of the »true* moments a,. Solving the equations successively with 
respect to the a，，we obtain 

«i = 

«i = — T l f 

A = *8 — i 九*， 

(27.9.3) a 4 = 4 4 - i d, 九 4 ， 

a 5 = 4 5 — 8 九意 + A 厶‘， 

c 6 = d 6 — } dt 4 fc* -f A fc 4 — tHi ^ € » 


These are the formulae known as Sheppard's corrections (Ref. 212). 
The general expression is (cf Wold, Ref. 245) 

a, = 2 0 (2 卜 ; 一 V, 

where the B t are the Bernoulli numbers defined bj (12.2.2). 

If we place the origin in the mean of the distribution, we have 
ofj == = 0 J and so obtain the corrections for the central moments: 

= fi t — \\ 

^4 = f l i — t fit + iU 


oo 亨 + * 

= / f(x)dx / (So + 

— od y - 《 i 


00 

f 


(* + 『，一 ( a 


f 


h(y-bl) 


-f(x)dx 



(27 9.4) 


361 





27.9 


These relations hold under the assumption that the remainder R 
in (27.9.2) may be neglected. Suppose now that we are given two 
positive integers 5 and k such that: 

1) f(x) and its first 2 $ derivatives are continuous for all x. 

2) The product ^(x) is bounded for all x and for i = 0, 
1, …， 2 兄 一 The function g (5) given by (27.9.1) will then De continuous 
for ail 5 together with its first 2 汶十 1 derivatives, and it is easily 
«een that for r = 1, 2,..., ^ and / = 0, 1,..., 2^+1 we have 

<27.9.5) ， (5)=0(§- 2 ) 

as f -*» ± oo. Consequently we may apply the Euler-MacLaurin formula 
in the form (12.2.6), and thus find that the remainder R may be 
written in the form 

B = ( - l)* +, A 4,+, / P 1 . + .(y)<? (1,+ "(?o + hy)dy. 

— 00 

It then follows from (12.2.1) and (27.9.5) that we have 


iRlKAh^ 


dy 


<Bh^ 


where A and B are constants not depending on h. Thus if /i, the 
width of the class interval, is sufficiently small, R may be neglected 
and the corrections (27.9.3) or (27.9.4) applied to moments of any 
order v H the error involved being of the order h u . 

Whenever the frequency curve y —f(x) has a contact of high order 
with the x axis at both ends of the range, the above conditions 1) and 
2) are satisfied for moderate values of s and Jc. In such cases, it has 
been found in practice that the result of applying Sheppard's correc¬ 
tions to the moments is usually good even when h is not very small. 
It is, however, always advisable to compare the amount of the correc¬ 
tion to be applied to a certain moment with the standard deviation 
of the sampling distribution of that moment. If, as is often the case, 
the correction only amounts to a small fraction of the s. d., it does 
not really matter whether the correction is applied or not. 

In cases where the frequency curve has not a high order terminal 
contact, it is usually better not to apply Sheppard's corrections. 
Other correction formulae have been proposed for use in such cases, 
but they do not seem to be of sufficiently general validity (cf Elderton, 
Ref. 12, p. 231). 


362 



27.9-28.1 


Langdon and Ore (Ref. 144) and Wold (Ref. 245, 246) have given 
corrections for the semi-invariants which are valid under the same 
conditions as Sheppard's. These have the simple form 

x t = x,, and x, ^ (v > 1). 

v 

The deduction of Sheppard’s corrections may be extended to mo¬ 
ments of multi dimensional samples. In particular we have for a two- 
dimensional distribution with class intervals of the length h x for x 
and for y 

( 27 9 6 ) ^11 = An» M*i = fiu, fhi ^ Aii — i fin K » 

• tt « = A** — iV Mio ^5 ~~ iV fiot K + iii K - 

The corrections for and ^ 13 are，of course, obtained by permutation 
of indices，and the corrections for the marginal moments fho and fioj 
follow directly from (27.9.4), so that by these formulae we are able 
to find the corrections for all moments of orders not exceeding four. 

It should finally be remarked that the problem of corrections for 
grouping has been treated also from various other points of view. 
The reader may be referred e. g. to Fisher (Ref. 89) and Kendall 
(Ref. 136). 


CHAPTER 28. 

Asymptotic Properties of Sampling Distributions. 

28.1. Introductory remarks. — In 27.3 and 27.8, we have seen 
that all ordinary sample characteristics that are functions of the mo¬ 
ments converge in probability to the corresponding population char¬ 
acteristics, as the size n of the sample tends to infinity. In the present 
chapter, the asymptotic behaviour for large .n of the sampling distri¬ 
butions of these and certain other characteristics will be considered 
somewhat more closely. Following up a remark made in 17.5, we shall 
first show that, under very general conditions, characteristics based 
on the sampie moments are asymptotically normally distributed for 
large v. We shall then consider certain other classes of sample char¬ 
acteristics, some of which are, like the moment characteristics, asymp- 

363 



28.1-2 


totically normal, while others show a totally different asymptotic 
behaviour. 

28.2. The moments. — Consider n sample values . . . t x n from 
a one-dimensional distribution. The quantity w «* = 2 x i a 8r| m of 

n independent random variables x ¥ v all haying the sfluie distribution ， 
with the mean E(x*) = a 9 and the variance D f (x ¥ t ) = aa*—oj. We may 
then apply the Lindeberg-L6vy case of the Central Limit Theorem (cf 
17.4) and find that, as w the d. f. of the standardized sum 

^ x i 一 na * — yr~ a 9 一 a ¥ 
n (cfa* — o*) V cfj v — a* 

tends to the normal d. f. (D(x). According to the terminology intro¬ 
duced in 17.4, any sample moment a 9 is thus asymptotically normal 
(a ，， V{a^—al)ln). We observe that the parameters of the limiting 
normal distribution are identical with the mean and the 8. d. of a 9l 
as given by (27.3.1). 一 In particular, the mean a l =^ x of the sample 
is asymptotically normal (m, olV^n), as already pointed out in 17.4. 

Similarly, when we consider simultaneously the two random vari¬ 
ables na v == 2 and na Q = 2 an application of the two-dimensio- 
nal form of the Lindeberg-L^vy theorem (cf 21.11) shows that the 
joint distribution of the two variables Vn[a 9 — a ，） and Vn(a 9 — a^,) 
tends to a certain two-dimensional normal distribution. The argument 
is evidently general, and by means of the multi-dimensional form of 
the Lindeberg-L4vj theorem (cf 24.7) we obtain the following result: 

The joint distribution of any number of the quantities Vn (a 9 — a v ) 
tends to a normal distribution with zero mean values and the second 
order moments 

X vy = al=- E (m (a* — a ， ) 童) = 一 at, 

(28.2.1) 

E(n(a v — a v )(a 9 — a Q )) = a ¥+(f — a v a 9 . 

Thus if we introduce standardized variables defined by 

(28.2.2) a* = a v + z 9% 

every z v will have zero mean and unit 8. d.，and the joint distribution 
of the z v will be asymptotically normal, with the covariances 

364 



28.2-3 


E{z r Z 9 ) 


(7l> (Tp 


The extension of the above considerations to moments of multi-dimen- 
sional samples is immediate. 


28.3. The central moments. 一 By the remarks made in connection 
vrith (27.5.2), any central moment m y may be written in the form 


w 

m % = a v — vxa v ..\ H — > 
n 


where 切 is a random variable such that E{w 9 ) is smaller than a 
quantity independent of n. According to 27.4, we may without loss 
of generality assume m = 0, so that a, = ^ and 


m v 一 ju，= a, — a, — vxa v ^\ -f 


tc 

n 


Introducing the standardized variables z v defined by (28.2.2), we then 
have 


(28.3.1) 


(tn v — fi v ) == a v z v — v o x H- 


JR 

V7/ 


where R = w — va i a v -\g l z v -\. Now by (9.5.1) 


E (I B1) $ E (I w; I) + V or, (Tm E (I ^ ^-il) 

S VEK) + va^r-iVE^jEjgHi}, 


so that £(|12|) is smaller than a quantity independent of w, and it 
then follows by an application of Tchebycheff's theorem (15.7.1) that 
B(Vn converges in probability to zero. Applying the theorem 20.6 
to the expression (28.3.1) we thus find that the variable Vn(m v — 
has, in the limit as n the same distribution as the linear ex¬ 

pression a 9 e ¥ — ya, ^-\Z X . The joint distribution of e 9 and z x is, 
however, asymptotically normal, and any linear combination of nor¬ 
mally distributed variables is, by 24.4, itself normally distributed. 

Thus any central moment m v of the sample is asymptotically normally 
distributed，with the mean ^ and the variance 

ol •— 2y^-i^ yl + _ H2*—2v/a v -\ /Ur+i — + y 1 \i\-\ 


n 


365 


n 


28.3-4 


We observe that the variance of the limiting normal distribution is 
identical with the leading term of D 2 (m») as given by (27.5.4). 一 If 
we consider simultaneously any number of the m，，we find in the 
same way, using the last theorem of 22.6, that the joint distribution 
of the m，is asymptotically normal, with the means fi” and variances 
and covariances given by the leading terms of (27.5.4) and (27.5.6). 
一 As in the preceding paragraph, the extension to moments of multi¬ 
dimensional samples is immediate. 

28.4. Functions of moments. — As in 27.7, we shall confine our 
attention to the case of a function H(m Vl w ? ) of two central moments 
from a one dimensional sample. However, the extension to any number 
of arguments, to mulfci-dimensional samples and to the joint distribu¬ 
tion of any number of functions is immediate. We shall prove the 
following theorem. 

Ift in some neighbourhood of the point m v — /i», m Q = the function 
H (m ，， m g ) is continuous and has continuous derivatives of the first and 
second order with respect to the arguments m, and m Q) the random variable 
H(m ，， m e ) is asymptotically normal, the mean and the variance of the 
limiting normal distribution being given by the leading terms of (27.7.3). 

It will be observed that in this tbeorem there is nothing corresponding to condi¬ 
tion 2) of the theorem of 27.7. Thus we may e. g. assert that the function — is 

_ fftf 

asymptotically normal though for certain populations (c( 27.7) neither 

the mean nor the variance of — is finite. We remind in this connection of a remark 

made in 17.4 to the effect that a variable may be asymptotically normal even though 
its mean and variance do not exist, or do not tend to the mean and variance of the 
limiting normal distribution. 

As in 27.7，we consider the sefc Z of all points (x x , . . rc„) such 
that Im ，一 〜I < « and \m Q — fi ? \ < e. In the present case we shall, 
however, allow e to depend on w, and shall in fact choose e = 

We then have, using the notations of 27.7 and choosing 蚤 = 1 ， 

9 a t 

P(Z)>l-^~=l-2An-i. 

€ M n 


If r? is sufficiently large, we have for any point of Z the development 
(27.7.6)，which may be written 

w (/f — 丑 0 ) = G (tw y — p ，） 十丑 f (m p — fi Q ) + R \Tn, 

366 



28 . 4-5 


where | R \Tn | < Ks^V^n = Kn~^. Thus the inequality | B \Tn \ < K 
is satisfied with a probability ^ P(Z) >l — 2A so that RV n 
converges in probability to zero. By theorem 20.6, we then find that 
the variables V^n (H — H 0 ) and H x V^n{m v — fx v ) H t V n (m 9 — fi f ) have, 
in the limit as w —♦ oo, the same distribution. Bj the preceding para¬ 
graph, the latter variable is, however, asymptotically normal with the 
mean and the variance required by our theorem, which is thus proved. 

It follows from this theorem that any sample characteristic based on 
moments is，for large values of n，approximately normally distributed 
about the corresponding population characteristic，with a variance of the 
form cfn，provided only that the leading terms of (27.7.3) yield finite 
values for the mean and the variance of the limiting distribution. 

This is true for samples in any number of dimensions. Thus e. g. 
the coefficients of skewness and excess (15.8), the coefficients of re¬ 
gression (21.6 and 23.2), the generalized variance (22.7), and the 
coefficients of total, partial and multiple correlation (21.7, 23.4 and 
23.5) are all asymptotically normally distributed about the corresponding* 
coefficients of the population. 

One important remark should, however, be made in this connec¬ 
tion. In general, the constant c in the expression of the variance 
will have a positive value. However, in exceptional cases c may be 
zero, which implies that the variance is of a smaller order than n~ l . 
Looking back on the proof of the theorem, it is readily seen that in 
such a case the proof shows that the variable Vn(ff — ff 0 ) converges 
in probability to zero, which may be expressed by saying that H is 
asymptotically normal with zero variance, as far as terms of order n" 1 
are concerned. It may, however, then occur that some expression of 
the form n^{H-H 0 ) with p > i may have a definite limiting distri¬ 
bution, but this is not necessarily normal. We shall encounter an example 
of this phenomenon in 29.12, in connection with the distribution of 
the multiple correlation coefficient in the particular case when the 
corresponding population value is zero. 

28.5. The quantiles. — Consider a sample of n values from a one¬ 
dimensional distribution of the continuous type, with the d. f. •F(cc) 
and the fr. f. f(x) = (x). Let ^ denote the quantile (cf 15.6) of 

order p of the distribution, i. e. the root (assumed unique) of the 
equation ^(C) where 0 < p < \. We shall suppose that, in some 
neighbourhood of a; = the fr. f. f(x) is continuous and has a con¬ 
tinuous derivative f (a:). 


367 



28.5 


We further denote by e p the correaponding ： quantile of the sample. 
If np is not an integer, and if we arrange the sample values hi 
ascending order of magnitude: x t ^ x a x n , there is a unique 

quantile e p equal to the sample value x_+u where /u = [np] denotes 
the greatest integer ^ np. If. np is an integer, we are in the in¬ 
determinate case (cf 15.5 一 15.6)，and z p may be any TAlne in the 
internal (xn Pt x np +i)- In order to avoid trivial complication*, we assume 
in the aeqael that np is not an integer. 

Let g(x) denote the fr. f. of the random variable «==r p . The 
probability g (x) dx that t is situated in an infinitesimal interval 
(a:, x + dx) is identical with the probability that, among the n sample 
values, ^4 = [nj)] are < x t and n 一 fi 一 1 are > x + dx x while the 
remaining* value falls between x and x -f- dx. Hence 

g [x) dx = ^ j (« — fi) FixD^^fix) dx. 

In order to study the behaviour of the distribution of « for large fi, 
we consider the random variable y = K n/pq/(0 (e — f), where ^ = 1 —p. 
By (15.1.2) y has the fr. f. 

where we have for any fixed a; as w — (cf 16.4.8) 


— 

o ㈣ )， 




A, 


VTn 


where / + Now J'(C) ― p y and thus 


Substituting this in the expression of A %) we find after some calculation 

368 



28.5 


*• 




l 

■o that the fr. f. of y tends to the normal fr. f. a, ^ ia also 

seen that A, ， 4 奮 and A $ are uniformly bounded in any interval 
a<x<b, so that by (5.3.6) the probability of the inequality a<y<b 

b 

1 /* 

tends to the limit J ^ ^ dx. 

a 

It follows that the sample quantile z v is asymptotically normal 

( r — .1 / 25), where C = is the corresponding quantile of the popula- 

/(C) V n l , 

tion. — In particular the median of the sample is asymptotically normal 

( r - - V where C == Cl is the median of the population. 

’ 2/(0 V ^»/ . 

For & normal distribution, with the parameters m and a, the median 

is m，and we have /(m) == ― - Thus the median ^ of a sample of 
、 a\ K ln 

n from this distribution is asymptotically normal a 


On the other hand, we know that the mean x of such a sample is exactly normal 
As n— «o, 2 and x both converge in probability to in, and for large 
valoes of 9 i we may use either z or x as an estimate of »i. The latter estimate should, 
however, be considered as having the greater precision, since the 8. d. -y =： corresponding 


(O. 


to x is smaller than the s. d. a 


VI n" 


1.2688 -y- corresponding to z. — A systematic 

V n 


comparison of the precision of various estimates of a population characteristic will be 
given in the theory of estimation (cf. Ch. 32). 


Consider now the joint distribution of two quantiles z and z \ of 
orders p i and p v where y x < p t . By a calculation of the same kind 
as above, it can be shown that this distribution is asymptotically 
normal. The means of the limiting normal distribution are the cor¬ 
responding quantiles C and C of the population, while the asymptotic 
expressions of the second order moments 〆'}, \x % [z') are 


PiQi Pi ?t PtQt 

n/(r)/(?T 


369 



28 . 5-6 


Choosing in particular p t = i ， i>i 3=3 C 肪 d f are the lower and 
upper quartiles of the population, and we find that the semi-inter- 
qaartiie rang^e (cf 15 . 6 ) of the sample, *(，" 一 〆 )， asymptotically 
distributed in a normal distribotioii with the mean if 一 CO And the s.d. 

1 1 / 3 2 3~~ 

7^)~'/(n/(n + 7 i (n' 

— For a normal (m, a) population，the mean of the semi-interqtiartile 
range becomes 0.6746 <y, and the 8. d. 0.7867 ^=- 

28.6. The extreme values and the range. 一 So far, we have only 
considered sample characteristics which, in larg^e samples, tend to be 
normally distributed. We now torn to a group of characteristics 
showing a totally different behaviour. 

In a one-dimensional sample of n values, there are always two finite 
and uniquely determined extreme valves , x ) and also a finite range 、which 
it the difference between the extremes. More generally, we may arrange 
the n sample values in order of magnitude, and consider the y：th 
value from the top or from the bottom. For y = 1 we obtain, of course, 
the extreme values. 

It is often important to know the sampling distributions of the 
extreme values, the v:th values, the rang^e, and other similar charac¬ 
teristics of the sample. We shall now consider some properties of 
these distributions. 

We restrict ourselves to the case when the population has a distri¬ 
bution of the continuous type, with the d. f. F and the fr. f. /= , 

Let x denote the y：th value from the top in a sample of n from this 
population. The probability element g ¥ (x)dx in the sampling distribn* 
tion of x is identical with the probability that, among the n sample 
values, n — v are < x t and v — l are > a; 4* dx y while the remaining 
value fails between x and x + dx. Hence 

( 28 . 6 . 1 ) » (:二 J) (F(a;))— (1 - F^Y^f^dx. 

If we introduce a new variable § bj the substitution 

0 If, e. g. f the two uppermost valnei are equal, any of them will be considered 
om tb« npp«r extreme value, 戴 nd similarly in other cases. 

370 



28.6 


(28.6.2) | = n(l _■)， 

we shall have 0 ^ ^ and the fr. f. K [D of the new variable will be 

(28.6.3) *.(?)= 11 J)^ * 

for 0 ^ ^ w, and lu (5) = 0 outside (0, w). As m — »， K (5) converges 
for any ^ ^ 0 to the limit 

(28.6.4) lim K (§) = frrT^. 

n-* oo I \y ； 


Further, h v (^) is uniformly bounded for all n in every finite §inter- 

val, and thus by (5.3.6) 5 is, in the limit as n -► <», distributed ac¬ 
cording to the fr. f. (28.6.4), which is a particular case of (12.3.3). 

Similarly, if y denotes the v:th value from the bottom in our 

sample, and if we introduce a new variable tj by the substitution 


(28.6.5) rj = nF(y\ 

we find that rj has the fr. f. h v (r}) and thus, in the limit, the fr. f. 


r(v) € - . 

We may also consider the joint distribution of the r:th value x 
from the top and the v:th value y from the bottom. Introducing the 
variables ^ and rj by the substitutions (28.6.2) and (28.6.5), it is then 
proved in the same way as above that the joint fr. f. of J and rj is 

_. 6) —i) •:硕4厂' 

where § > 0， > 0, f and 2v <n. As w -> <», this tends to 


(28.6.7) 

80 that 5 and rj are, in the limit, independent. 

When the d. f. F is given, it is sometimes possible to solve the 
equations (28.6.2) and (28.6.5) explicitly with respect to x and y. We 
then obtain the v:th values x and y expressed in terms of the auxiliary 
variables § and rj of known distributions. When an explicit solution 
cannot be given, it is often possible to obtain an asymptotic solution 
for lar^e valaes of n. In sach cases, the known distributions of 5 

371 



28.6 


and rj may be used to find the limiting forms of the distributions of 
the y：th values, the range etc. We now proceed to consider some 
examples of this method, omitting certain details of calculation. 


1. The rectangular distribution. 一 Let the sampled variable be 
uniformly distributed (cf 19.1) over the interral (a, b). If, in a sample 
of n from this distribution, x and y are the y：th values from the top 
and from the bottom, (28.6.2) and (28.6.5) give 




•V = fl 


.V ， 


where § and rj have the joint fr. f. (28.6.6), with the limiting form 
(28.6.7). Hence we obtain 


E(x) = b 




(n -r 1 )* (« + 2) 

and similar expressions for y. We further have 



—a + b 


\ 2 / 

2 * 

\ 2 / 


2 (w + 1) (« + 2) 


(b- a)\ 


which shows that the arithmetic mean of the r:th values x and y 
provides a consistent and unbiased estimate (cf 27.6) of the mean 
(a + 6)/2 of the distribution. Fiuallj, we have for the difference x 一 y 

(28.6.9) £(x- y ) = (l- )7 ^ T )(6-«), />•("} 书 

For y = 1 the difference x — tf is, of course, the range of the sample. 

2. The triangular distribution. 一 In the case of a triangular distri¬ 
bution (cf 19.1) over the rang^e (a, 6), the equations (28.6.2) and (28.6.5) 

• * 、 a + b - a + b 

give, when x> ―- 一 - and y < ― jr—» 


x = b — (b — a) J/^~* y = a + (& — fl) ^ • 

We consider only the particular case v = 1, when x and y are the 
extreme values of the sample, and then obtoin 

372 



28.6 


E (宁卜 宁， D ，(宁) = 宏 ( 6 _心 0 鉢 

(28.6.10) _ 

E(x—y)^ ( l -V^ (b ~ a]+0 (^)' D ， (x - y)= l? (卜 a):+ 0 ⑸. 


3. Cauchy's distribution. — For the distribution given by the fr. f. 
(19.2.1), the sabatittition (28.6.2) gives 




nl f 

^JJTT 


(t-n)* 


:- arc cot x -~— ， 
n k 


or 


x = fi -b I cot ^ = /i + ^| 4- 0 (I) 

where $ has the limiting distribution (28.6.4). The remainder con¬ 
verges in probability to zero, and it then follows from 20.6 that the 

it 

v:th value x from the top is, in the limit, distributed as ^ 

1 1 」 

where t;=-| - has the fr. f. Similarly the v:th value from 

ilw 

the bottom, y, is distributed as ju - tv t where w is, in the limit, 

7t 

independent of v and has a distribution of the same form. In the 
case » = 1, the mean values of x and y are not finite. For v > 2 we 
have 

(28.6.11) 五 ㈣^)' “ ， D* ( ? y= 2 (v - 1)«(v - 2) ) + ° W 


We observe that the variance does not tend to zero as n 


Ac¬ 


cordingly X ^ - does not converge in probability to ^ bo that 


2 


is not a consistent estimate (cf 27.6) of fi. 

4. Laplaces distribution. 一 For the fr. f. (19.2.4) we obtain for 
the y:th value x from the top, when x> 


a? = ^ + A iogr j — 又 log 

where § has the limiting distribution (28.6.4). Substituting r for 
— log f，we thus have 


B73 



28.6 


X = jLl + k log 姜 + A V ， 


where v =— log $ has, in the limit, the fr. f. 


Similarly, the y：th valne from the bottom is 





where w is, in the limit, independent of v and has the fr. f. j 9 («*). 
In the particular case v = 1 we have (cf the following example) 


陶■⑼ E (宁卜， 0*(5^) = ^+0(1), 


怎 + V 

and we observe that, as in the preceding case, — ^ is not a con¬ 
sistent estimate of fi. 

5. The normal distribution. 一 Consider first a normal distribution 
with the standardized parameters m = 0 and or == 1. If x is the ，: th 
value from the top in a sample of n from this distribution, (28.6.2) gives 

f n f JL 

^Vf^J 6 idt 

a* 


It is required to find an asymptotic solution of this equation with 
respect to x y when n is large. By partial integnration, the equation 
may be put in the form 

空 = 士 ’ 7 ( 1 + 0 ⑸). 

Assuming | bounded, we obtain after some calculation 


x = V 2 log ： n 


log log n -f log 4 7t 
2 V2 \og n 


Jpg g 

V i log n 


0 



and it follows that the remainder converges in probability to zero. 
Proceeding to the general case of a normal distribution with arbi- 

374 



28.6 


trarj parameters m and a, we need only replace x by —-— • Substi- 

tutiD^ at the same time v for —— log 5 ， we thus Hnd that the v:th 
value x from the top has the expression 


(28.6.13) 


a V 2 log n 


log log n 4 - log 4 tt 

^ 2K2lo^r 


+ V2i^ V} 


where v — log ^ ^ a variable which, in the limit as n -► oo, has 
the fr. f. 

(28.6.14) j r (v) = 一 ^ 


already encountered in the preceding example. Similarly we have, for 
the y:th value y from the bottom, the expression 


(28.6.15) y^m-o ^2 log « + n 

2 V 2 log n 


V2 log n 


where tv is, in the limit, independent of v and has the fr. f. j 9 (w). 

Thus for large values of n the y.th values x and y are related bj 
simple linear transformations to variables having the limiting distri¬ 
bution defined by the fr. f. (28.6.14) The frequency curves u = j t (v) 
are shown for some values of v in Fig. 27. 



Fig. 27. The frequency curve u =j t (v) for v = 1, 2, 3, 4. 

375 


28.6 


We observe that the limiting distribution has, except for different normalisation, 
the same form as in the preceding example. A straightforward generalization of the 
above argument shows that the same limiting fr. f. j v (v) appear* in all cases where 
the fr. of the parent distribution ig, for large ▼»lues of \x\, aaymptoticftlly ex¬ 
pressed by 

fix) ~ A t~ B \ x \ v , 

where A, B and p are positive constants. 

The mode of a variable which has the fr. f. j v (v) is 一 log v, while 
the mean and the variance are given by the relations 

OO oo 

E(r) = J vj v (v)dv = — —^ J f*"" 1 log 。一 

— oo 0 

oo 

D*(v) = J v l j v (v) dv — (C — 5,)* = 

— OO OO 

=JV -1 log* ((7—^)* = ~ ― 5 f , 

0 

obtained by means of (12.5.6) and (12.5.7). Here C denotes Euler s 
constant defined by (12.2.7), while 

Si= i + 卜 + ^, s,=n,+ 

Hence we obtain for the v:th value x from the top: 

E(x) ^m + a ㈣ 

\ 2 f 2 log n 

(28.6.16) 

D ’ (和 ~ S *) +0 (iog*^) 1 

and similar expressions for the v:th value y from the bottom. We 
further obtain 

_. 17 ) ㈣ ，， D ’ (宁 卜 4 音“|，— + 错 

cc 4* v 

so that in this case — gives a consistent estimate for m, though 
the variance only tends to zero as (log 作 ) _1 , which is n.ot nearly so 

376 




28.6 


rapidly as n~ l . 一 For the difference x — y between the* y：th values 
we have 

邱 - 籍广 + o(^)y 

(28.6.18) 

D，{x - v 卜二 (1 - s *) + 0 (▲). 

We may thus obtain a consistent estimate for <r by multiplying a; — y 
with an appropriate constant, and the variance of this estimate will, 
for a given large value of u t be approximately proportional to 

* 

The limiting forms discussed above in connection with the normal 
distribution and Laplace's distribution are due partly to R. A. Fisher 
and Tippet (Ref. 110), and partly to Gurnbel (Ref. 120), in whose 
papers further information concerning ： the properties of these distri 
bations and their statistical applications will be found. 

In the limiting expressions for the case of the normal distribution, 
the remainder terms are of the same order as a negative power of 
log n. Now log n tends to infinity less rapidly than any power of 
n t and accordingly it has been found tkat the approach to the limiting 
forms is here considerably slower than e. g. in the case of the ap¬ 
proach to normality of the distribution of some moment characteristic. 
The exact distributions of the extreme values and the range of a sample 



Fig. 28. Distribution function for the upper extreme of a sample of n values from a 
normal population with m » 0 and <j =» 1. 

Exact: - . Approximate formula : 

377 



28.6-29.1 


from a normal distribution hare been investigated by yarioot ftothon, 
and certain tables are available. The reader is referred to K. Pear¬ 
son's tables, and to papers bj Irwin, Tippet, E. 8. Pearson and 
Davies, E. S. Pearson and Hartley (Ref. 264, 181, 226, 196, 197). 
We gi^e in Fig. 28 some comparisons between the exact distribution 
of the largest member of a sample and the corresponding distributions 
calculated from the limiting expressions (28.6.13)—(28.6.14). 


CHAPTER 29. 

Exact Sampling Distributions. 

29.1. The problem. In the two preceding chapters, we have shown 
how to calculate moments and various other characteristics of sampling 
distribations, and we have investigated the asymptotic behaviour of 
the distributions for samples of infinitely increasing size. However, it 
is clear that a knowledge of the exact form of a sampling distribu¬ 
tion would be of a far greater value than the knowledge of a number 
of moment characteristics and of a limiting expression for largfe values 
of n. Especially when we are dealing with small samples, aa is often 
the case in the applications, the asymptotic expressions are sometimes 
grossly inadequate, and a knowledge of the exact form of the distri¬ 
bution would then be highly desirable. 

Suppose that we are concerned with a sample of n observed valaes 
from a one-dimensional distribution with the d. f. JP(a;), and that we 
wish to find the sampling distribution of some sample characteristic 
g(x v . . x n ). The problem is then to find the distribution of a given 
function g{x u . . x n ) of n independent random variables x ，， 

each of which has the same distribution with the d. f. jP(a?). 

Theoretically, this problem has been solved in 14.5, where we have 
shown that there is always a unique solution, as soon as the functions 
F and g are given. Numerically ， the problem may often be solved by 
means of the computation of tables based on approximate formulae. 
If, however, we require a solution that can be explicitly expressed in 
terms of known functions, the sitnation will be quite different. At the 
present state of our knowledge such a solution can, in fact, only be 
reached in a comparatively small number of cases. 

One case where a result of a certain generality can be given, is 

378 



29.2-3 

the simple case of the mean a = — ^ of a one-dimensional 8? n ^ en ^ 

n ^ ^ 

In Chs 16 一 19 we, have seen (cf 16.2, 16.5, 17.3, 18.1, 19.2) that many 
distributions possess what we have called an addition theorem y i. e. a 
theorem that gives an explicit expression for the d. f. G n {x) of the 
sum + • - + x n , where the Xi are independent, each having the given 
d. f. F(x). The d. f. of the mean x is then G n (w x) y and thus we can 
find the exact sampling distribution of the mean, whenever the parent 
distribution possesses an addition theorem. — We shall give some 
examples : 

When the parent F(a?) is normal (wi, a), we have seen in 17.3 that 
the mean x is normal (m, a/V~n). 

When F{x) corresponds to a Cauchy distribution, we have seen in 
19.2 that has the same d. f. F(x) as the parent population. 

When the parent has a Poisson distribution with the parameter 又， 

1 2 

the mean x has the possible values 0, and it follows from 

n n 

(16.5.4) that we have P ^ e" n x . 

Apart from the case of the mean (with respect to this case, cf 
Irwin, Ref. 132), very few results of a general character are known 
about the exact form of sampling distributions. Only in one particular 
case, viz. the case of sampling from a normal parent distribution (in 
any number of dimensions), has it so far been possible to investigate 
the subject systematically and reach results of a certain completeness. 
In the present chapter, we shall be concerned with this case. 

Some isolated results belonging to this order of ideas were dis¬ 
covered at an early stage by Helmert, K. Pearson and Student. The 
first systematic investigations of the subject were, however, made by 
R. A. Fisher, who gave rigorous proofs of the earlier results and 
discovered the exact forms of the distributions in fundamentally 
important new cases. In his work on these problems, Fisher generally 
uses methods of analytical geometry in a multi-dimensional space. 
Other methods, involving- the use of characteristic functions, or of 
certain transformations of variables etc., have later been applied to 
this type of problems. In the sequel, we shall ^ive examples of the 
use of various methods. 

29.2. Fisher’s lemma. Degrees of freedom. 一 In the study of 
sampling distributions connected with normally distributed variables, 

379 



28.6-29.1 


from a tilowing ： transformation due to R. A. Fialier (Ref. 97) is often 
and cul. Suppose that x lf . . x n are independent random variables, 
each of which is normal (0, a). Consider an orthogonal transformation 
(cf 11.9) 

(29.2.1) y» = cnx { + 心 2 怎 , 十 .+ c tn x n 、 (i = 1,2,..., n), 

replacing the variables x u , . •、 x n by new variables y,, . . y n . By 
24.4, the joint distribution of the y t is normal, and we obtain 
(cf Ex. 16, p. 319) JB(y,) = 0, and 


E(y{yk) = o ± ^ j c ij Ckj^ 


I a s for / = l\ 
{0 for i / k 、 


so that the new variables are uncorrelated. It then follows from 
24.1 that they are even independent. Thus the transformed variable ,、 
yi are independent and normal (0, a). 

The geometrical signification of this result is evident. The trans¬ 
formation (29.2.1) corresponds (cf 11.9) to a rotation of the system 
of coordinates about the origin, and our result shows that the parti¬ 
cular normal distribution in K n considered here is invariant under 
this rotation. 

Suppose now that, at first, only a certain number p < n of linear 
functions y t , … 、 y p are given, where yi=^c,iX x + • + c, n Xn, and 

the Cfj satisfy the orthogonality conditions 

* I 1 for i = k, 

2j Ci ^ Ck i^ \ 

i-i I 0 for i ^ k, 

for * = l t 2, . . p and Ar = 1, 2, . . p. By 11.9 we can then always 

find n—p further rows c,i, . . d, lt where 1, . . such that 

the complete matrix C nn = {c^} is orthogonal. 一 Consider the quad¬ 
ratic form in x lf . . x n 

(29.2.2) Q(x u • • ” 知 ) = 2 o?? — y? — — y l p . 

i 

n 

If we apply here the orthogonal transformation (29.2.1), 2^ by 

i 

n 

11.9 transformed into y*, and thus we obtain 


^ = + . - + y\. 

380 



39.2-3 


Thus Q is equal to the sum of the squares of n — p independent 
normal (0, a) variables which are, moreover, independent of y lt . . y 9 . 

Using (18.1.8), we obtain the following lemma due to R. A. Fisher 
(Ref. 97): 

The variable Q defined by (29.2.2) is independent of y,, . . y p and 
has the fr.f. 



where kn(x) ts the fr.f. (18.1.3) of the distribution. 

The number n — j) is the rank of the form Q (cf 11.6), i. e. the 
smallest number of independent variables on which the form may be 
brought by a non-singular linear transformation. In statistical applica¬ 
tions, this number of free variables entering into a problem is usually, 
in accordance with the terminology introduced by R. A. Fisher, denoted 
as the number of degrees of freedom (abbreviated d. of fr.) of the 
problem, or of the distribution of the random variables attached to 
the problem. 

n 

Thus e. g. the variable = 2^ an( ^ ^8 fr. f. k n (a;) considered in 

l 

18.1 are said to possess n degrees of freedom, since the quadratic 
form x % is of rank n. The corresponding distribution will accordingly 
be called the with n degrees of freedom. 

n 

Similarly the form Q — y! —. 一 yj of rank n 一 p con- 

i 

sidered above will be said to possess n 一 p decrees of freedom, and 
the result proved above thus implies that the variable Qlo % is distri¬ 
buted in a distribution with n — p degrees of freedom. 

The same terminology will often be applied also to other distri¬ 
butions. In the case of Student's distribution, it is customary to say 
that the fr.f. 5 n (x) defined by (18.2.4) is attached to Students distri¬ 
bution with n degrees of freedom y since the quadratic form in the de¬ 
nominator of the variable t as defined by (18.2.1) has the rank n. 
For Fisher's r-diatribution (cf. 18.3), we have to diatinguish between 
the m d. of fr. in the numerator of (18.3.1), and the n d. of fr. in 
the denominator. 

29.3. The joint distribution of t and s % in samples from a normal 
distribution. — We have already pointed out in 29.1 that the mean 

381 



29.3 


x of a sample of n from a parent distribution which is normal (m, a) 
is itself normal (wi, ojVn). We now proceed to consider the distribu¬ 


tion of the sample variance 


攻 , : 


: Wj j 


^2( 妁-甸， 


and, at the same 


time, the joint distribution of £ and 8 9 . Without loss of generality, 
we may then assume that the population mean m is *aro, since this 
does not affect and is equivalent to the addition of a constant to x. 

We thus assume that every X{ is normal (0 y a), and consider the 
identity (cf 11.11.2) 


(29.3.1) 


= 2 一 龙 ) 詹 = 2 W 一 ”宏 * • 


Now 


”卜(舍 +••• + 好 


is the square of a linear form 


e l x l +- b c-nXn such that c? + • • • + ci = 1. We may thus apply the 

lemma of the preceding paragraph, taking in (29.2.2) p = 1 and 
y x ^Vnx. Returning to the case of a general population mean m y 
we then have the following theorem first rigorously proved by R. A. 
Fisher (Ref. 97): 

The mean x and the vanance s a of a normal sample are independent, 


and x is normal (m, a/V~n) t while ns % la % is distributed in a x % di$tribu- 


tion with » — l degrees of freedom. 


It can be shown that the independence of and ft 1 holds only when the parent 
distribution is normal (cf Geary, Kef. 116, and Lukacn, Ref. 160). On the other hand, 
we have seen in 27.4 that x and t 1 are uncorrelated whenever the third central mo¬ 
ment of the parent distribution 1 h zero. 


It follows from the theorem that the unbiased estimate (cf 27.6) 
of the variance, — s % > has the fr. f. h n -\ Com- 

1 n 

paring with the fr. f. of — 2 x * given in the table at the end of 18.1, 

i 

• n 1 ** 

it is seen that the variable - — s* = 2 ( x< 一 ^) 1 diatributed 

l # 

as the arithmetic mean of w — 1 squares of independent normal (0, a) 
variables, in accordance with the fact that there are w — 1 d. of fr. 
in the distribution. 


382 



29 .^ 


The mean and the variance of s* = m t have already been given in 
(27.4.5). By means of (18.1.5) we obtain the following general ex¬ 
pression of the moments 


(29.3.2) 


= (” 一 l)(w + 1)(^ -f 3) - • (n -h 2y — 3)^ a r 


Hence we deduce the expressions for the coefficients of skewness and 
excess : 

, 、 2 V "2 , 、 12 


For the s. d. s = V~m t of the sample we obtain from the theorem, 
using Stirling's formula (12.5.3) 


(29.3.3) 





in accordance with the general expressions (27.7.1) and (27.7.2). 

In view of the great importance of the theorem on the joint 
distribution of x and 6*, we shall now give another proof of the same 
result, using certain transformations of variables, combined with geo¬ 
metrical arguments. As before, we suppose in the proof that m = 0. 

Consider the ” -dimensional sample space R n of the variables 
x ly . . x n . Our sample is represented by a variable point in this 
space, the sample point X=X(x v . . a: n ). Let XR be the perpendicular 

from X to the line Xj = a:, = • • • = x n . Then R has the coordinates 
(x, . . x) so that the square of the distance OR from the origin 0 

_ _ _ n 

to 12 is w 尤 *， and consequently XR* = 0 X* 一 OR* = 一 w x* = w s*. 

i 

The joint distribution of the variables Xi is conceived in the usual 
way as a distribution of a mass unit over H n ，and the probability 
element of this distribution is 


383 




(2 一 


We now perform a rotation of the coordinate axes, such that one of 

the axes is brought to coincide with the line OR. This rotation is 

« 

expressed bj an orthogonal substitution yt = where one of 


the y t) gay y n 、is equal io Vnx 


We then obtain 


2^ = 2^ =ti 龙 * + and hence 'Ey' =The determinant 

li l l 

of the substitution being ± 1, we have bj (22.2.3) 


dP= — dy,... dyn-r dy, 
(2 ^>< 7 * 



(2 «)V 


We further introduce the substitution 

(29.3.4) y x ^=V n8Zu (i‘ = 1 ， 2, • • .，ti 一 1), 

which signifies that we take the length XE : = VnsKH unit. How- 
ever, bj the last substitution we have replaced the w — 1 variables 
bj n new variables s and z u . . M e n -i- Accordingly there is a relation 
between the new variables, which is found by squaring and adding 
the n 一 1 equations (29.3.4). We then obtain 

(29.3.5) 2^ ，==1 > 

i 

and thus one of the say en-i, may be expressed as a function of 
the n — 2 others, so that in (29.3.4) the old variables y x> . . y n -i 
are replaced by the new variables s and z v . . For the Jacobian 

J of the transformation we have, since - - - > 

dti “-i 


384 



29.3 




0 .. 

... 0 


z x 

1 

0 • . 

. 0 

VnZi 

■ 

ins.. 

... 0 

«—1 

_ n~ 广 2 

^1 

0 

1 .. 

. 0 

V n e n -2 

0 

0 .. 

. .V ns 

多 n-l 

^ft-Q 

0 

0 .. 

1 


-Vn 

A.. 

之 n—1 

1, 一 艺 n-2 

.—V ns - 

Zn-\ 


艺 ，•一 1 _ 

—A 

— .? 2 • . 

. 一 2 ■ ，卜 2 


，卜 1 n-l 

= (-1 y ， -.!LEi!l a = + —. 

Zn ~ l 1^1 — ~ 一 Zn-2 

To any system of values y n -i) — (0, . . • ， 0) we obtain from 

(29.3.4) and (29.3.5) a uniquely determined system of values of 
之 1 ， • . . 》 ^n -2 and 5 ， such that 5 > 0. On the other hand, to any given 

n -2 

system of values of ■?”•• •，^ n -j and s, such that ^z\ < 1 and ^> 0 , 

i 

there correspond two values of z n -i with opposite signs determined 
by (29.3 5), viz. z n -\ = ±.V\ — z\— 一 4 - 2 , and thus two systems 
of values of the say 3 /,, . . yn- 2 ， 土 yn-i- Both these systems yield 

the same value of the probability element dP and the modulus |J| 
of the Jacobian, and thus we obtain by means of a remark in 22.2 
the expression 




6* n - 


(2 7T)V 


aV2n 


4 


Vl —Z\— ‘ 一 忒 — 2 


rs n - 2 e 2at ds> 


dxdsdz l . . . dz n 


T 、/w — 1 

\ 


M 2 

I dz { . 

.. dz n ^i 


a n ~ l r 


<71—• \ 


S - 


The probability element dP appears here as a product of three fab- 
tors, viz. the probability elements of £ and and the joint probability 
element of z u . . ^ M _ 2 . We thus see (cf 22.1.2) that x and s are in¬ 

dependent not only of one another, but also of the combined variable 
( 名 1 ， . . . ， ^n-a), and that the distributions of x and 5 are those given 
by the above theorem. 1 ) 

*) The same result can be obtained by means of the transformation x t = x + 8Z lt 

which has been used for this and other purposes e.g. by Behrens, Steffensen, Hasch 
and Hald (Ref. 00, 218, 206). 


385 


29.3 


For a later purpose we finally observe that, in the general case 
when the population mean m is not zero, the above transformation 
of the probability element may be written 


dP= 




dx x . . . dx n 


(2 tt)V 

(29.3.6) 


Vn_ 




dx- 






^ds- 


广 1 厂 



7t 2 


de x .. . dzn-i 

V 1 一 si - ri-a 


Consider the effect of the above transformation on the expresoion 


m* 


:㈣ ， 


(v > 7). 


By means of the identity (29.3.1), it is easily shown that e?ery a*, - jr is transformed 
into a linear combination of It then follows from (29.3.4) and (29.3.5) 

that m v m7 9li is a function of z lt .. z n ^ only. Thus th« three variables x, 8 and 

m, mt v,i arc independent. (Cf Geary, Ref. 118). 

Following Geary, we can use this observation to obtain exact expressions (first 
given bj Fisher, Ref. 101) for the mean and the variance of the coeflicients 
q x - m, mr #/, and g t = m 4 wi, -2 ~ 3, instead of the asymptotic expressions f27.7.9\ 
It follows, in fact, from the independence theorem that 


E 笱.五㈡ = £(<}， 


/ 

ho that the mean valae of (m v can be calculated from E(mJ) and E J. 

In this way we obtain 

^(^i) 8=3 0, E {g t ) - 8 


n + 1 


(29.3.7) 


fl (» — 2) 


物 »-+V 

n ” 、 24”(n — 2)(» — 囂 ) 

° (n+.l) f (n + S)(n + 6) 


ThuH g t is affected with a negative bias of order n— 1 ，while g x is unbiased. It, instead 
of ff, and g if we consider the analogoas quantities 


,29.3.8) 


frl X；/« »-i 9, ' 

w « = f| = („'- ，， 2 K» 1 -sj [(n + 1)s * + 911 
386 



29.3-4 


where the Kv are the imbiMed Mml-invariAiit estimates of Fisher (cf 27.8), the bias 
dUappeari, and we obtela 

鳶 ( Gd 讀鬈 (6^-0, 


( a 9. S .9) 


D^O,) 


flit(n — 1) 

(㈣ 一 2)(n + l)(n+3)’ 


DHG t ) 


_ 24 n ( 鳟一 1)， _ 

(w — 8)(» — 2)(n+ 8)(n -hT)* 


29.4. Student's ratio. 一 Consider the variables V- n (x — wt) and 

一 ~ ，*，when the parent distribution is normal (m, a). According to 

the preceding paragraph, these two variables are independent, and 

)Tn (x — m) is normal (0 } a), while — is distributed as the arith- 

n 一 1 

metic mean of n 一 1 squares of independent normal (0, cr) variables. 
By the definition of Student's distribution in 18.2, the ratio 


<29.4.1 )、 


Vn [£ m ) - r x — m 

- ■■二 二二 _ = K n — 1 - 

1 /^ s 


is then distributed in Student’s distribution with n 一 1 degrees of free¬ 
dom. Thus t has the fr. f. 


知 -i(a?) 


V(n — 1 j n 






This can, of course, also be shown more directly. Assuming for 
simplicity w = 0, we replace the sample variables x u • . x n by new 
variables y、、• • •、yn by means of an orthogonal transformation such 

n » 

+ •十 Then ns 9 = 

thus 


—wi 8 =2y* and 
1 2 


that = V~n £ = 


2 


387 



29.4 


where by 29.2 the yt are independent and normai (0, a). We can 
then directly apply the argument of (18.2.1) 一 (18.2.4). 

If, in the first expression of t in (29.4.1), we replace —by its mean a*, we 

n 一 1 


obtain the rariable Vn - - , which is obviously normal (0, 1). It follows from 20.6 

o 

that the difference t— Vn— —— converges in probability to zero as » —► co. Accord¬ 
ingly by (20.2.2) the fr. f. of t tends to -r— «-*•/】 as n oo. 

V 2n 


The variable t defined by (2S.4.1) is known as Student’s ratio. 1 ) 
Its distribution was first discovered by Student (Ref. 221), whose 
results were then rigorously proved by R. A. Fisher (Ref. 97). 

Ab already pointed out in 18.2, the fr. f. 5 n -i, as well as the vari¬ 
able t itself, does not contain a. As soon as we know m, we may 
thus calculate t from the sample values, and compare the observed 
value of t with the theoretical distribution. In this way we obtain a 
practically important test of significance for the deviation of the sample 
mean x from some hypothetical value of the population mean m (cf 31.2 
and 31.3, Ex. 4). 

Of even greater practical importance is the application of Student s 
distribution to test the significance of the difference between two mean 
values (R. A. Fisher, Ref. 97; cf 31.2). The sampling distribution 
relevant to this problem is obtained as follows. 

Suppose that we have two independent samples x ly . . x n% and 
•Vi，• • •》ynv drewn from the same normal population. Without loss of 
generality, we maj assume m = 0. Let the mean and the variance of 

1 〜 1，，• 

the first sample be denoted by x = — ^ X( and a? = 一 ^ [xt — 

while y and si are the corresponding characteristics of the second 
sample. We now replace all the n t v % variables a;,, . . x nn y x , ••” y% 
by new variables by means of an orthogonal transforma¬ 

tion such that z l = Vn l x and z t = V^n % y. The quadratic form 

»t ♦»» 

Q = + n t s\ = + 2 W 一” 〆 一料，乡 1 


is then transformed into Q == 2 which shows that the rank, or 

_ s 

*) Student actually considered the ratio z — tl^ ”一 1 = (5 — m)l$ 

388 



29.4 


the number of d. of fr., of is Wj + »* — 2. If we define a random 
variable u by the relation 


(29.4.2) 

V w, + «, Vq 

V »i »,(»| + n t — 2 ) .1 — y 

w, + v, 

u is then transformed into 



where w and z Sl . . are independent and normal (0, a). We can 

now once more apply the argument of 18.2, and it follows that the 
variable u is distributed in Student's distribution with n x + w 2 — 2 d. of 
,r” so that u has the fr. f. 〜 +, 4 - 3 ( 怎 ). This result evidently holds true 
irrespective of the value of m. — It will be observed that in this 
case neither the variable u nor the corresponding fr. f. contains any 
of the parameters m and a of the parent distribution. Thus we can 
calculate u directly from the sample values, and compare the observed 
value of n with the theoretical distribution (cf 31.2 and 31.3, Ex. 4). 

n n 

Consider the quadratic form n «* = 2 (x, ~ xf = 2 arj — » a：* in the n sample 

l l 

variables x it .. x n , assuming that the population mean m is aero. Replacing the x t 
by new variables y { by means of an orthogonal transformation such that the two 
first variables are 



the form n 8* is transformed into 2 y}. Consequently the variable 

2 

Xx~ x 

> 

8 


(29.4.8) 


389 



29.4-5 


which expresses the deriation of the sample value Xy from the sample mean IF, 
measured in nnita of the s. d. 8 of the sample, becomes 

_Vi_ 


v 


n — 1 


Now y%, . . . y n »re independent and normal (0, cO，and thus by (18.2.6) and (18.2.7') 
the variable r has the fr. f. (cf Thompson, Kef. 226, and Arley, Ref. 63) 


r 


49.4.4、 


^(n ■ 




叫宁 > 


f 1 - 士 i) J ’ （ M^—• 


The variable 


：Vn — 2 


T* 


is then, by 18.2, distributed in Student's distribution with 


2 d. of fr. 


It follows from the definition of r that these results hold ir¬ 


respective of the value of m. Any relative deviation 




has, of course, the same 


distribution as r. These results are of importance in connection with the question 
of criteria for the rejection of outlying observations. 

• — Xj -f* * + •**!• 

More generally, if we consider the arithmetic mean jr^ = ---， where 


— x 

k < n, and write z k = - - , the variable r k 


»nd consequently the Tariable 
(20.4 6) 


V r t 


一 1) 


has the fr. f. (29.4.4\ 


Xk Vk^zri) 

— k — kt\ 


has Stadent's distribation with n — 2 d. of fr. (Thompson, Ref. 226、;. This may be 
used for testing the significmnce of the difference between the menu of a sab-group 
and a general mean (cf 31.3, Ex. 6). 


29.5. A lemma. — We now proceed to the study of sampling 
distributions connected with a multi-dimensional normal parent distri¬ 
bution. In this preliminary paragraph, we shall prove certain results 
due to Wishart and Bartlett (Ref. 240, 241) that will be required in 

1 a n • • • 叫 

• • • • L where ajt^ at y, be a definite positive 
. at*J 

r Xu ... X\k 1 

matrix (cf 11.10) with constant elements, while X = < . • • • 1 


. Xkk) 


390 



29.5 


where xj t = xtj y is a variable matrix. Owing to the symmetry X con¬ 
tains, of course, only | 奋(条 + 1) distinct variables Xij. The determi¬ 
nants of the matrices are denoted by ii = | | and X = | a：/j |. 

Consider now the + l)-dimensional space R^k(k+i) of the 

yariables where Let S denote the set of all points of this 

space anch that the corresponding matrix X is definite positive, while 
S* is the complementary set. For any v > Jc, we now define a func¬ 
tion of the variables xtj by writing 


(29.5.1) fn(X\\^ • ♦ Xkk) ^ I 


CknA 2 X 3 in 5, 

0 in 5 *， 


where Ct n is a constant depending on k and ”, but not on the a t , 
or the Xij. The sum is extended over 卜 u and )=1， … ，I 
We shall nofc show that the constant Ckn may be so determined that 
fn(x Ui ••” Xkk) is the fr.f. of a distribution in Kjut+i). 一 The com¬ 
plete expression of Ct% is, in fact, 


(29.5.2) C k9 




晖卜序)… r ( 宁) 


For 1c — \ y (29.5.1) 一 (29.5.2) reduce to f n (x ) ： 


奸） ， 

(a: > 0 ， a > 0), which is evidently a fr. f. in R t . 

For k> we have to show that Ckn may be determined such 
that the integ^ul of f n over the whole space K 士 *(i+i} is equal to 1. 
We shall first consider the particular case when ^4 is a diagonal 
matrix (cf 11.1), so that atj = 0 for i j. Since A is definite posi¬ 
tive, we then have > 0 for t == 1, . . Ar. — In any point of the 
set S f we have x“ > 0 for t = 1, . . . 、 1c. Introducing, for every xtj 
with i 7^ the substitution 


(29.5.3) 


Xij: 


■yijVxuXjj, 


we have yj t and X= DYD, where D denotes the diagonal 

matrix with the elements Vx iX% Vx u> . . Vxtk, while 

391 



29.5 


1 V\z - - - .Vi* 

yz\ 1 ••- y^k 

yu . 1 

Denoting by l r the determinant of Y we thus have X=^\ t x 2 2 - - - ^kk Y. 
When X is definite positive, so is F, and conversely. The Jacobian of 

the transformation (29.5.3) being (or n a: 12 . . . Xkk) 2 , we thus have 




k 

e i 


dx n dx n . . . dxu 




d </' I j C? 工 22 . . . ^ 工禽 Jk* 



H— k— 2 

^ " ^2/ i 2 • * • dy*-1,*， 


the integral with respect to the ytj being extended over the set S' 
of all ytj such that Y is definite positive. Obviously the integral 
with respect to the y,)，say J*, depends only on k and w, so that the 
whole integral reduces to 



[a n a 22 • • . a k k) 2 A 2 


where Hkn depends only on lc and n. Taking in (29.5.1) Cicn = 
it follows that the integral of f n {x xx , • • • ， Xkk) over the whole space 
Hjfc(fc+i) is equal to 1, so that /„ (being obviously non-negative) is the 
fr. f. of a distribution in 

In order to complete the proof m the case when a t j = 0 for i ^ j t it remains to 
verify the expression (29.6.2) for C kn . It follows from the above that we have to prove 

J ^l Y 3 dyu - d ^-i,k ==：t 4 n r (土 弓 ， 

for 2 ^ A < n. This may be proved by iadnction, and we shall indicate the general 
lines of the proof. For k = 2, our relation reduces to 

392 




29.5 


J* = / (1 — y 1 ) 1 (ly = V~7t - 


m 

wr 


which may be directly veritied, since the substitution y % = z changeH the integral 
into a- Beta-fnnction (cf. 12.4). Suppose now that oar relation has been proved for a 
certain value of k, aad consider Expanding the determiDant under the integral 

according to (11.6.3), we obtain for J^ +1 the expression 

k n—k—3 

f rf V.i ■■d« k - 1 , k f{Y-'S.Y, 1 y lk+J y jk ^ 1 ) > dy lL+1 ... dy k t+1 

矿 <,户1 * 

where the integral ^ith respect to the y, k+l has to be extended over nil values of 
k 

the variables such that ^ fc + 1 fc+1 < 1' The latter integral may be evaluated 

* ，产 1 

by the same methods as the integrals (11.12.8)—(11.12.4), and、ve obtain 


1 = ^ 


’( 中 > 

尸 (!L^i) 




n 


t / 1 


( 宁 > 


Thus the relation holds for 厶 + 1， and the proof is completed. 


In the general case when A is any definite positive matrix, we 
consider the transformation 

(29.5.4) CAC = B y CXC= Y, 


where C is an orthogonal matrix ic 
( cf 11.9). The set S in the : r-spa.e • t 
set S x in the t/-space. From the roc 
that the function 

n~l n 

(29.5.5) 9n(y'i 、 . ， • 、 ytek、^ ‘ 仏 ’ 1 B 

0 

is a fr. f. in the y-space. (Note that we have b,j = 0 for i — . 7 .) Now, 
since the determinant of C is equal to + 1, we have B and 
X= Y, and it is farther verified by direct substitution that we have 
2 j ― 2 ^ j- Thus if, in the distribution (29.5.5), we introduce 

the transformation of random variables defined by (29.5.4), we obtain 
according to 22.2 a transformed distribution with the fr. f. fn(x n , ...Tu). 
Thus f n is a fr. f., and our assertion is proved. 

393 


♦ at B is a diagonal matrix 
， s’c med into the analogous 
<yiv* above, it then follows 






in S x , 
in Si t 



39.5-6 


Id the fwrticnUr case & _ 2 , there ore three varisblea x u , ar at and ar lt = r tl . 
The set 8 is the domain defined by the inequalities r u > 0 , ar lt > 0 , jr?, < j*h 
I n S we have 

/ w (*n,*ii.af*t) 

n 一 1 n 一 4 

=C„(a H (x„x„- x?,) - *" e _" 

where (cf 12 . 4 . 4 ) 

c ** = Vnr^~jr\~^ 篇 nr{n-T) 

Oatside 8 the fr. f. is sero. 

We shall also consider the c. f. g> n [t n 、• • &*) corresponding to the 

fr. f. fn(x w • • •, Xkk) defined by (29.5.1). Let T== [tij] denote the sym¬ 
metric matrix of the variables t ”、and put 

_ ( 1 for t^jy 
€ii \ J for i ^ j. 

Since /» = 0 in S* t the c. f. corresponding to the fr. f. f n is 

/ i £ *tj x tj 

e t>J /» ( 欠 li， ••” Xkk)dx iX dx lf ... <lx k k. 

s 

(In order to avoid confusion, we use here a heavy-faced i to denote 
the imaginary unit, as already mentioned in 27.1.) For tij = G, the 
integral is equal to 1, so that we have 

/ ! , T±=* ㈣ 1 

X 2 e i>J dx u ax l9 . . . dxtk^ - - 

« C kn A^ 


Beplacin 盔 here a t j by a t j — is"Uj 、 and denoting by A* the deter¬ 
minant A* ^ \aij — we obtain finally the expression 

_. 7 ) 纪 1 

for the c. f. corresponding to the distribution (29 5.1). *) 


29.6. Sampling from a two-dimensional normal distribution. 一 
In a basic paper of 1915, R. A. Fisher (Ref. 88) gfave exact expressions 

*) Ingham (Ref. 180) hM shown directly that the c.f. (29.6.7) gives, according to 
the inversion formula (10.0.9), the fr.f. (39.6.1). 

394 



29.6 

for certain sampling distributions connected with a two-dimensional 
normal parent distribution. We shall now prove some of Fisher s 
Tesolts，using the method of characteristic functions first applied to 
these problem 鼸 by Romanovsky (Ref. 208, 209). It will be found that 
the distributions obtained are particalar cases of the distributions 
considered in the preceding paragraph. 

Consider a non-sin^alar normal distribution in two Tariabies (cf 
21.12). Without loss of generality, we may assume the first order 
moments equal to zero, so that the fr. f. is in the usual notation 

- V —— a ( 专 _? ^ + 专） = 1 _ e ~TS ^ 

2 7 i a x a t Vi — V* 2 nV M 


where ^ — juji = o? <rl (1 — q*) is the determinant of the mo¬ 
ment matrix M = | ^*° >• From a sample of n observed pairs of 

lA*u ^oi J 

values (xj, y,)» . . (x n , we calculate the moment characteristics of 
the first and second orders (cf 27.1.6) 



rrtfQ = 5? = ~ 2 ( T < — ― ~ 2 

n i n t 

(29.6.1) 

= >• »i *，== 二 21 (** — — y )= - 

n < n i 

wo, =«; = - 2 (y< — y)* = - 2 — »*• 

n t n i 


We now propose to find tie joint distribution of the five random 
variables 龙、 y 、 m M ， m n and The c. f. of this distribution is a 

function of five Tariabies t v t %y / Ml t ix and viz. 

f；(gi 谓 •》+，• • 霸 》>)= 

(29.6.2) 


(2 itY M 7 ^ 


e s dx i . . . dx n dy x 


dy ，、 


where 


Q 


di. +...+ t M m oi ) — 為 y，+ M*o y?)- 


395 



29.6 

and the integral is extended over the 2 n-dimenBional space of the- 
variables x Xy . . x n , y u . . y n . 

We now replace x„ . . rr w by new variables Sm by mee^a 

of an orthogonal transformation such that ^ ~ Vnx y and apply a 
transformation with the same .matrix to y,, . . y n> which are thus- 
replaced by new variables rj iy . . rj n such that rj t = We then have 


i i 

n n 

1 1 

iv5 = 
1 

=i >*， 

i 

»»»» = 2)^- 
2 

A 

w »» n = 2?<»7.. 

2 

nm ot = 

iw ， 

j 

and hence 







今 ‘ + 2 (- S 

- iU '\ 





Introducing this expression of Q in (29.6.2), the transformed 2” fold 
integral reduces to a product of n double integrals, which may be- 
directly evaluated by means of (11.12.1) and (11.12.2). The joint c.f. 
(29.6.2) then takes the form 


(29.6.3) 

where 





A 


»^0f 


2M 

2M 

w ^ii 


2M 

2M 


n m 

•iW 


A* 


郎 ot 

2M 


2M 


“，o ' 

i“u 


Hfhi 

2M 

郎昶 

2M 


iit n 


一仏 


396 



29.6-7 


the c. f. (29.5.T) reduces to the second factor of (29.6.3). The corres¬ 
ponding distribution is then the particular case A: = 2 of (29.5.1 )， 
(which has already been given in 29.5.6), with the variables x n , x lt 
and x u replaced by m i0 , m n and m ot respectively. Thus by 22.4 we 
have the following theorem: 

The combined random variables (x, y) and (m t0) m n> m ot ) are indepen¬ 
dent. The joint distribution of x and y is normal^ with the same first 
order moments as the parent distribution, and the moment matrix n~ l M. 
The joint distribution of m u arid m oi has the /，••/•/» given by 

(29.6.4) /„ (m,o, m u , m ot )= 

n—4 

n n~l m ot 一 Wjj) a -5^ (/<«• »»«o-2/tn ».»+/«*, m„) 

^ 4 7rr(n—~2) 6 

M 2 

in the domain m lQ > 0 , m 0l > 0 , mj, < m t0 while / n = 0 outside this 
domain. 

The mean values and the momeat matrix 0 / the five sample moments may be 

calculated from the o. f. (20.6.8). We find, e. g., E (m t0 ) =* -—- fi M , BJutn)—- ―- /jt u , 

H tl 

E (mot) - fiat* in accordance with 27.4 and 27.8. 

n 

29.7. The correlation coefficient. — In the joint distribution (29.6.4) 
of the variables m 10 , m n and we now introduce the new variable 

l ) If, more generally, we consider a parent distribution -with arbitrary mean values, 
、ve obviously obtain here the same means as for the parent distribntion. 

397 


The joint c. f. (29.6.3) is a product of two factors, the first of which 
contains only the variables t x and t v while the second factor contains 
only t n and t ov The first factor is, by (21.12.2), the c. f. of a 
normal distribution with zero mean values l ) and the moment matrix 
M. The second factor, on the other hand, is a particular case of 
the c. f. (29.5.7). In fact, if we take in the preceding paragraph k==2 and 


11 1,J 

.n Mo 

w 一 2712 


Ifs 



29.7 


r by the substitution m n = rV m f0 m 01 , so that r is the correlation 
coefficient of the sample. By (22.2.3), we then obtain the following ： 
expression for the joint fr. f. of n? w , and r: 


V m 20 m 0S fn (w M , rV m M ，= 


to ^ot (1— 户 ） 2 e 


-2^n r\ 


4/rr(« - 2) M 2 


where m, 0 > 0, w? 0 , > 0, r* < 1. The marginal fr. f. of r is now ob¬ 
tained by integrating the joint fr. f. with respect to m w and from 

— Mjj f 〆 fflgg 

0 to +oo. If the factor e M is developed in power series, 

the integration can be explicitly performed, and we thus obtain the 
fr.f. of the sample correlation coefficient r: 


(29.7.1) 


2»-3 

7t(n — 3) 


- p*”（i - cV 2 户 ( 一 Y f ~7T 


for 1 < /• < 1. The power series appearing in this expression may 
be transformed in various ways. We find, e. g., by simple calculations 
the expansion 

C dx 2 M_S rt l n ^ : 1 

J = \ 厂 

o 

and hence obtain the following expression for the fr. f. of r: 



(_ / “和 V~ 2 (1 (】 -，) 丁 /(T^piyr^. 


The distribution of r was discovered by R. A. Fisher (Ref. 88). 
We observe the remarkable property that the distribution of r only 
depends on the size n of the sample and on the correlation coefficient 
q of the population. 

For w = 2, the fr.f. fn(r) reduces to zero, in accordance with the 
fact that a correlation coefficient calculated from a sample of only 
two observations is necessarily equal to ± 1, so that in this case the 
distribution belongs to the discrete type. For w = 3 the frequency 

398 



29.7 



Fig. 20 ». Frequency chitm for the correlation coefficient r in samples from a 
normal population, n =■ 10. 

curve is [/-shaped, with infinite ordinates in the points，• = 土 1. For 
n = 4 we have a rectangular distribution if ^ = 0, and otherwise & 
c^-shaped distribution. For w > 4, the distribution is nnimodal, with 
the mode situated in the point r = 0 if ^ = 0, and otherwise near 
the point r q. Some examples are shown in Figps 29 a —— b. 

The distribution of r has been studied in detail bj several authora 
(cf e. g. Soper and others, Ref. 216, and Romanovsky, Ref. 208), and 
extensive tables have been published by David (Ref. 261). Various 
exact and approximate formulae for the characteristics of the distribu¬ 
tion are known. Any moment of r can, of course, be directly calcul¬ 
ated from (29.7.1), but we shall here content ourselves with the 
asymptotic formulae for E(r) and D* (r) for larg^e n that have already 
been given in (27.8.1) and (27.8.2). 

For practical purposes, it is often preferable to use the trans¬ 
formation 

(29.7.3) ^ = J log ， C = i log 

introduced by R. A. Fisher (Ref. 13, 90). Fisher has shown that the 
variable z is, already for moderate values of w, approximately nor- 

399 



29.7 



Fig. 20 b. Frequency corves for the oorrelation coefficient r in samples from a 
normal population, n = 60. 


mally distributed with mean and variance given by the approximate 
expressions 

(29.7.4) = 0*(,) = ^. 

Thus the form of the ^-distribution is, in the first approximation, in¬ 
dependent of the parameter q, while the distribution of r changes its 
form considerably when q varies. It is instructive to compare in this 
respect the illustrations of the r- and e-distributiona given in Fi^s 
29 and 30. Cf further 31.3, Ex. 6 . 

In the particular case ^ = 0, the fr. f. (29.7.1) reduces by (12.4.4) to 


(29.7.5) 


Mr)= h 



a form conjectured by Student (Ref. 222) in 1908. We have already 
encountered this fr. f• in other connections in (18.2.7) and (29.4.4). 

By 18.2, the transformed variable t = Vn — 2 7 ： —L:: is in this case 

V1 — r* 


400 



29.7 



1 + 1 * 

Fig. 80 瓤 . Frequency corves for r = 4 log -- in samples from & normal popula- 

1 — r 

tion. n = 10. 



tioo. n = 50. 

distributed in Student’s distribution with n — 2 d. of fr. If t p denotes 
the p % value of t for « — 2 d. of fr. (ti 18.2), we have the prob¬ 
ability p % of obtaining a value of t aiich that \t\> t p , and this 
inequality is equivalent with (cf 31.3, Ex.| 7) 

{29 - 7 - 6) H> Frfci!_ 

401 



29.8 


29.8. The regression coefficients. 一 - The regression coefficients of 
the parent distribution 




他 = A, = ^ , 

Mto a i ^Ot a i 


have been defined in 21.6. In accordance with the general rules of 
27.1, the corresponding regression coefficients of the sample will be 
denoted by 


(29.8.1) 




Ife will be sufficient to consider the sampling distribution of one of 
these, say b iV The distribution of b 12 can then be obtained by per¬ 
mutation of indices. 

In the joint distribution (29.6.4) of m 20 , m n and m 0i , we replace 
m n by the new variable b ti by means of the substitution m n = m t0 b tl . 
We can then directly perform the integration, first with respect to 
m oi over all values such that m 02 > m i0 bl u and then with respect to 
m i0 ever all positive values. In this way we obtain the following ex¬ 
pression for the fr.f. of the sample regression coefficient b u : 


(29.8.2) 


r 


(i) 


M 




r (宁) 


Mto 2 (^io 一 2 /in b %l + Mo*)^ 


This distribution was first found by K. Pearson and Romanovsky 
(Ref. 185, 210). If we introduce here the new variable 

' \ 

(29.8.3) # = 

where M = ft l0 fx 0% 一 p:" it is found that t is distributed in Student’s 
distribution with n — 1 d. of fr. 

If we compare the distribution of b i{ with the distribution of r, 
it is evident that the former has not the attractive property belonging 
to the latter, of containing only the population parameter directly 
corresponding to the variable. The fr. f. (29.8.2) contains, in fact, all 
three moments /x t0 , fi n and fi ot ， and if we want to calculate the 
quantity t from (29.8.3) in order to test some hypothetical value of 

402 





29.8-9 


fi tl 、we shall have to introduce hypothetical values of all these three 
moments. In order to remove this inconvenience, we consider the variable 

(29.8.4) 卜: ■— 

where the population characteristics a,, a ± and q occurring in (29 8 3) 
have been replaced by the corresponding- sample characteristics s t 
and r, while the factor V~n — 1 has been replaced by V n — 2. If this 
variable { is introduced instead of w 08 in the joint distribution (29.6 4), 
the integration with respect to m n and m i0 can be directly performed, 
and we obtain the interesting result that ( is distributed in Student's 
distribution with n — 2 d. of fr. (Bartlett, Ref. 54.) The replacing of 
the population characteristics by sample characteristics has tlms re¬ 
sulted in a loss of one d. of fr. — When it is required to test a 
hypothetical value of ^ 21 , we can now calculate { directly from an 
actual sample, and thus obtain a test of significance for the deviation 
of the observed value of b 2i from the hypothetical (Cf 31.3, Ex. G.) 


29.9. Sampling from a ^-dimensional normal distribution. 一 The 

results of 29.6 may be generalized to the case of a ^-dimensional 
normal parent distribution. Consider a non-singular normal distribu¬ 
tion in k dimensions (cf 24.2). Without loss of generality, we may 
assume the first order moments equal to zero, so that the fr. f. is 
(cf 24.2.1) 


(29.9.1) 


( 2^) </2 Va 


^ X J . 


(2 n) k,，i cfj . . . a* 1 


where A = {A,；} is the moment matrix, and P — j) the correlation 
matrix of the distribution (cf 22.3). A and P are the corresponding 
determinants. Throughout this paragraph，the subscripts t and j will 
always have to run from 1 to k. 

Suppose now that we dispose of a sample of n observed points 
from this distribution. Let the v:th point of the sample be denoted 
by (xi Vf 冗 2 »>， • • .， Xlv), where y = 1, 2, . . w, and suppose n > k. We 
then calculate the moment characteristics of the first and second order 
for the sample. According to the general rules of 27.1, and the nota¬ 
tions for the corresponding population moments introduced in 22.3, 
these will be denoted by 


403 




29.9 


A sU * 1 ，.， 

(29.9.2) d $ 2 ( 怎“一为)’， 

r=l 

1 W 

Itj = r { j $ t sj = - 2 U'iv — if*) (x JV — x y ). 

v~l 

There are k sample means x Lf and Jc variances h * = s]. Further, since 
lj t = l l7} there are \h{k — 1) distinct covariances l tJ with i /j. The 
total number of distinct variables Z，』is thus \k{k ^ 1). 

The matrices L = {/,；} and R = {r u } are the moment matrix and 
the correlation matrix of the sample, while the corresponding deter¬ 
minants are L ==\l tJ \ and i? = | r ti7 -1. 

The joint distribution of all the variables x t and l t j can now be 
found in the same way as the corresponding distribution in 29.6. In 
direct generalization of (29.6.2), we obtain for the joint c. f. of all 
these variables the expression 


(29.9.3) 


(2 Tty 


e Q dx u . . . dxkn } 


n = C •无 1 + 炙山-士 2 2 丄， 

i t,j ♦ = ! hj 


where the integral is extended over the Ajw dimensional space of the 
variables x tv (i == 1, . . A:, r = 1, . . n), while as in 29.5 we write 

€ tJ ~ l for i = 7 , and Sij-- \ for i ★ j. 

For every i 、we now replace the set of n variables Xu, . . Xi H by 
n new variables &i，. . •， §/，》， by means of an orthogonal transformatioD 
such that = V~nxK, using the same transformation matrix for all 
values of i. We then have for all i and j 

n n 

2= 2&，&，.， 

v=l r=l 

n n 

你 if! ， = Xiy Xj V — tl Xi Xj ^ Jr 、 

¥ — 1 


and hence 


404 



29.9 


公=為2 ❻ 1 一 占2 山一 

(29.9.4) * iJ 

i,i ' 

Introducing; this expression of Q in (29.9.3), the integral may be 
evaluated in the same way as the corresponding integral in (29.6.2), 
and the joint c. f. (29.9.3) assumes the form 

(29.9.5) 

where A and A 0 denote the determinants of the matrices 

and 

’={ 势- 

Thas in particular A == (J «)* yl" 1 . In the same way as in 29.6, the 
joint c. f. is a product of two factors, the first of which is the c. f. 
of a normal distribution，while the second is of the form (29.5.7), 
and thus corresponds to a distribution of the form (29.5.1), with 
A = and the matrix of variables X = L = {?<,/}. Denoting by 

S the set of ail points in the \1e{k + l)-dimensional space of the 
variables Uj such that the symmetric matrix L is definite positive, 
we thus obtain the following generalization of the theorem of 29.6: 

The combined random variables (x x , ••” 办 ） and (Z u , l 1%} . . hk) are 
independent. The joint distribution of x iy . . , t £k is normal, with the 
same first order moments as the parent distribution，and the moment 
matrix n^ 1 A. The joint distribution of the \k{k + 1) distinct variables 
hj has the fr.f. f n given by 

(29.9.6) • ， … ， G , 描产 L 中，化 … 

for every point in the set S, while / n = 0 in the complementary set S*. 
The constant Ct n is given by (29.5.2). 

405 



10 


This theorem was first proved by Wishart (Ref. 240) by an ex¬ 
tension of the geometrical methods due to R. A. Fisher, and then by 
Wishart and Bartlett (Ref. 241) by the method of characteristic 
functions. We also refer to a paper by Simonsen (Ref. 213 a). 


29.10. The generalized variance. — The determinant L = \J(j\ re¬ 
presents the generalized variance of the sample (cf 22.7). Following 
Wilks (Bef. 232), we shall now indicate how the moments of L may 
be determined. For the explicit distribution of h y we refer to Kull- 
back (Ref. 143). 

The integrnil of the fr. f. f n in (29.9.6) over the set S is obviously 
equal to 1. Now the set >S> is invariant under any transformation of 
the form Wij = a Uj\ where a > 0. Taking a = w, and writing W^\wtj\ y 
we thus obtain 





W 




dw xl . 


• • dwik 


w—1 

( 2 MP 

一 


Since this relation holds for all values of n> k y we may replace n 
bj w -f 2 v 纛 nd then obtain, after reintroducing the yariables l t j, 


L 


+ 'e-^ W ' J dl n 


After multiplication with C\n 
(29.9.6) and (29.5.2), 


(^/l) ^ ^ ves > taking account of 


(^ 4 \ 

* Cm 



Ck n-f 9 * 

\ n k ) 




'n — i\ 


for w + 2y > i. e. for any >» > -* J(n — A). In particular we have 

E(L) = 二 


D f (L )： 


n k 


k{2n 4- 1 一 ifc) (» — 1>* • . • (w ~ A-) ! , 


(n — k)(n — k + 1) 


it ik 


A\ 


For a one dimensional distribution (蚤 =1) we have L = / u — w t and 

406 



29 . 10-11 


A =»<r*, and the above expression for E(L 9 ) then reduces to the for¬ 
mula (29.3.2). 


29.11. Th« generalized Student ratio. 一 Consider now a sample 
from a ib-dimeiisional normal distribution with arbitrary mean yaloes 
m, y . . M mfc, and denote by tij the product moments about the 
population mean: 

i n 

( 29 . 11 . 1 ) - 2 ( Xi9 ~~ — hi + (£i — mi) (xj — mj), 


where the ott and the Uj are g^iven by (29.9.2). 

There are 4 灰(灸 + 1) distinct variables Hj. If we write 一 m u 

the joint c. f. of the Kj becomes 


where 


(2tz) 





fj ***1 

=~ 72 2 [~^i ~ ie >i (t i) 


Comparing this with (29.9.3) 一 (29.9.5) we find that the c. f. of the 
tij in (A/A 0 )^ 2 , where A and A* denote the same determinants as in 

(29.9.5) . It follows that the joint fr. f. of the is obtained if, in 

n k 

(29.9.6) , we replace n by » + 1, except in tb» two factors and 

点 ， which arise from the matrix A. 

Writing V = j|, we then obtain by the Bame transformation as 

in the preceding paragraph 

渺卜 ㈣ ) y 。 可 

for any 銲 > 一 | (« + 1 — k). 一 On the otherihand, according ： to (29.11.1) 
L f is a function of the random variables iL and 5 ,= 负一 w，，and the 
joint fr. f. of all these variables is by the tpeorem of 29.9 

407 ' 



29.11 


(§,/)= 

(2n)^VA 

where f n (l) ^ fn(l ix , /«, • . hk) is given by (29.9.6). Thus we maj 
also write 


E(L^) = / Ddidl, 


where the integral is extended over the set S (defined in 29 6) with 
respect to the U 3) and over (— oo, oo) with respect to every Here 
we may now apply once more the transformation of the preceding 
paragraph, writing Wi] = nl” and rjt = and then replacing n 

by 7i -t 2 v. Equating 1 the two expressions of E(L ffi ), we then obtain 
for any v > 0 and /u> — v — J(w 4 - 1 一灸） 


£(///» 



Taking fi — — v, this reduces to 




\ 2 .. 
，，㈤ 




Thus bv (18 4.4) the variable LjU has the same moments as the Beta- 
distribution with the fr f 



(0 < x < 1). 


Since a distribution with finite range is uniquely determined by its 
moments (cf 15.4), it follows that LIL' has the fr. f. (29.11.2). On 
the other hand, we obtain from (29 11.1) 

408 



29.11 


1/=+ 2 ( 右一陶 )( 勾一 叫)， 

u 

L 1 

■ * 

― m j) 

where Ltj u the cofactor of hj in L. The quadratic form in the de¬ 
nominator is non-negative, since L is the moment matrix of a diatri- 
bution, viz. the distribution of the sample. 一 If we now introduce a 
new vwriable T by writing ： 

(29.11.3) T* == (n — 1) 2 m <) m i)> 

U h 

where T 2 0， we have 

L _ 1 

r = d :， 

n — 1 


L' 




U 


and by a simple transformation of (29.11.2) the fr.f. of T is found 
to be 


(29.11.4) 


2 r 







[x > 0). 


For ft = 1, this reduces to the positive half of the ordinary Student 
distribution (18.2.4) with n — 1 decrees of freedom. The distribution 
of T has been found by Hotelling (Ref. 126), and the above proof i» 
due to Wilks (Ref. 232). 

Just as the ordinary Student ratio t may be used to test the signi¬ 
ficance of the deviation of an observed mean x from some hypothetical 
value m, the generalized Student ratio T provides a test of the joint 
deviation of the sample means x Xt . . ^ from some hypothetical 
system of values m,, . . mt. 

In 20.4, we hftTO shown how the Stadent ratio may be modified ao m to provide 
ft test of the difference between two meui values. An analogous modification may 
be applied to the generalized ratio T. 

Snppoie that w® are given two samples of n t and n t individoals respectirely, 
drawn from the same /c-dimensional normal population, and let x u , l xi j and x u , 
denote the meanr, ▼arianccs and coTariances of the two samples. Let further H 
denote the matrix 


409 



29.11-13 


H 麵 l 71 j } _ ”,！••+ n . !»•， 


while H and «r« the corrMponding determinant and ita coraotort. WrftiDff 


(29.11.6) 


w,(n, +n • — 
n t + n,~ 


-S - ‘)问广 ? V ) 

U U 


where U ^ 0, it oan be shown by the same method* m abovt tllM U hmt the fr. f. 
(29.11.4) with ft replaced by », -f* w, — 1. The •zpreasioii (29.11.6) is entirely free 
from the pArameters of the parent dittribution, bo th 纛 t cad be directly ealoulated 
from a sample and used m a test of the joint divergence between th« two tyateniB 
~i u and x^ t of sample meant. For /r 晒 1, U will be leen th»t IP i« identioal with 
u 1 as defined by (29.4.2). 


29.12. Regressidn coefficientt. — For a two-dimensional distribu¬ 
tion we have seen that the variable (29.8.4), which is simply connected 
with a sample regression coefficient, has the ^-distribution with « — 2 
d. of fr. This result has been generalised by Bartlett (Bef. 64) to 
distributions in any number of dimensions. 

Replacing io (23.2.3) and (23.4.5) the population characteriitics 
by sample characteristics, we obtain for the reg^reBsion coefficient 
/ >i 2 . 34 . .t the expressions 


^13-3 4 


8 i ^11 


^12 - 84 


^1 • S4 . k 

k -- » 

S7-U. . .k 


where the residual yariances 8 may be calculated from the sample 
correlation coefficients r as shown by the first relation (23.4.6). 

If fiu.u .. k denotes the population value of the regrestion co¬ 
efficient, the variable 

(29.12.1) t = Vn — k t — fta 34 . k) 

S\-n. k 


has Student's distribution with n — kd. of fr. In the same way 麄 ■ 
in the case of (29.8.4), we can thna obtain a test of tignificanoe for 
the deviation of the observed value ^ of a regression coefficient from 
any hypothetical value p. (Cf 31.3, Ex. 7.) 

29.13. Partial and xnultipto correlation coefficients. 一 We now 
proceed to some further applications of the distribution (29.9.6), re¬ 
stricting ourselves to the particular case when the k variables in the 
normal parent distribution are independent. In this case itj，Qij A” 
all reduce to zero for 13 ^ i % so that the moment matrix ^ is a dia¬ 
gonal matrix, while the correlation matrix P is the unit matrix (cf 22.3). 

410 



29.13 


In the joint distribution (29.9.6) of the we replace the l,j with 
i 9^ j by the sample correlation coefficients by means of the sub¬ 
stitution hj = n j V i { i lj j. We then have L = / n l u . . . hk where 
R = \rtj\ iB the determinant of the correlation matrix R of the sample. 
The Jacobian of the transformation (cf the analogous transformation 
29.5.3) is (^ n . . . lkkY k ~ 1),2 , and the joint fr. f. of the variables h t and 
becomes bj (22.2.3)，in the particular case considered here, 


C kn 


n* 


\2* Ajj A|| • •. 又 fc * 



(hyiu- ■ -l^)~ 



n 2 

e 1 i A a y 


for Ut > 0 and all values of the r t j such that the matrix R is definite 
positive. For all other values of the variables, the fr. f. is zero. 

We can now directly integrate over (0, <») with respect to everj la. 
After introduction of the value (29 5.2) of Ckn, we obtain the joint 
Jr.f. of the sample correlation coefficients r< j ： 


(29.13.1) 


(奸) 厂 


*(*-!) 


.(宁 m 


R 


According to the terminology of Frisch (Ref. 113)，the determinant 
R ' * the square of the scatter coefficient of the sample (cf. 22.7). The 
m a)exits of R may be determined by the method of 29.10. Denoting 


bj Bkn the factor of R 


(29.13.2) 


E(R) = 


D^B )： 


in (29.13.1)，we find, e. g., 

(ti — 2) (n 一 3) . . . (n —~ k) 


Bkn 
Sk ,» + 2 

.k(k-l) 


(n-l) k ~ 


0 ⑸ 


The partial correlation coefficient between the sample values of the 
variables x x snd x ty after elimination of the remaining variables 
x % , x ky in by (23.4.2) 


(29.13.3) 


84.. 


及 n 


where the Rtj are the usual cofactors of R. In the particular case of 
an uncorrelated parent diatribation considered here, the corresponding 1 
population value ^ 12 .m. .u is, of course, equal to zero. 

411 



29.13 


In order to find the distribution of rn 34 we regard (29.13.3) 
as a substitotion replacing r ia by a new variable rn ， s« . while all 
the nj except r lt are retained as variables. R 1X and i? 22 do not 
involve r lt , and thus (29.13.3) can be written, using notations analo¬ 
gous to those of 11.5, 


- 34 





where Q does not involve r 12 . This shows that there is a one-to-one 
correspondence between the two sets of variables. The Jacobian of 
the transformation is 


dru ^VR^R；, 

^>*12 34 k R n n 


From (11.7.3) and (29.13.3) we further obtain 


R 


■tin 22 


Introducing the substitution (29.13 3) in (29.13.1), we thus find that 
the joint fr. f. of n 2 .34 1 and all r tJ other than r 12 is 


C 


(^11 -^aa) 


(1 一 ，! 2 




where C is a constant. This is the product of two factors, one of 
which depends only on rn. 34 it, while the other depends only on the 
r t j. Since the variable rn .34 k obviously ranges over the whole 
interval ( 一 1 ， 1), the multiplicative constant in its fr. f. is easily de¬ 
termined, and we have by (22.1.2) the following theorem : 

The partial correlation coefficient nj. 34 k is independent of all the 

Uj other than r 12 , and has the fr.f. 

^(n—k -f 1\ ! 士 - 2 

i r — ) 

(29.13.4) , (-1 <*<!). 

^ r (〒” 

We observe that by (29.7.5) the total correlation coefficient r 12 
has in the present case the fr. f. 


412 




29.13 



In order to pass from the distribution of r ls to the distribution of 
•*，we thus only have to replace n by w — (it — 2 ), i. e. to sub¬ 
tract from n the number of variables eliminated. E. A. Fisher (Ref. 
93) has shown that this property subaists even iu the general case 
when the variables in the parent distribution are not independent. 

In the case of independence，it follows (cf 29.7) that the variable 



where，• = . 34 


has Student's distribution 


with w — A; d. of fr. Consequently the inequality 


(29.13.5) 



where tp is the p % value of t for n — k d. of fr., has the probability 
p%. (Cf 31.3, Ex. 7.) 

The multiple correlation coefficient ri( 2 . . between the sample 
values of .Ti and (x t1 . . Xk) is, bj (23.5.2), the non-negative square root 


(29.13.6) 


Tl (2 



The corresponding population value 91 ( 2 . . k) is, in the present case 
of an uncorrelated normal parent distribution, equal to zero. We 
now propose to find the distribution of ri(a i). 

In the joint distribution (29.13.1) of the nj t we replace the i — 1 
variables ••” r lt by the k new variables r = ri (* 2 . k) and 

tky by means of the relations (29.13.6) and 

ru = z % r, (i = 2, 3, . . i). 

Between the new variables, we then have by (11.5.3) the relation 

1； 

2 凡 1 •*«; 々勺 = 及11, 


by which one of the z u say z it may be expressed as a function of 
the other z x and the r t j with t > \ and j > 1 . Tbe Jacobian of this 

413 



29 •里 3 


transformation is 






z i 

户 2 

d z 2 


和 12 

^ r 12 

<9 r 12 


汐， s 

r ^ t . 

r 0z k 

() r 

0 2^ 

d Zk 



r 

0 

0 

dr ik 

drik 

dr\k 


之 4 

0 

r 

、 o 

dr 


Ozk 


Zk 

0 

0 

r 


r k ^Q\ 


where Q f does not involve r. Further, we obtain from (29.13.6) 
Ji = B n (\ — r*), and thus the introduction of the above substitution 
in (29.13.1) yields an expression of the form 

n—t—a 

Q f \ 

for the joint fr. f. of the new variables, where Q n does not involve r. 

Thus the multiple correlation coefficient rut.. .k) ts independent of 
all the Tij with t > j > l t and has the fr. f. 


(29.13.7) 


2 厂(宁 > 





(0 < x < 1). 


The square r a has the Beta-distribution with the £r. f. 


(29.13.8) 






The distribution of r was found by R. A. Fiiher (Ref. 94), who 
also (Ref. 98) solved the more general problem of finding thii distri¬ 
bution in the case of an arbitrary normal parent distribution. In this 
general case, the fr. f. of r may be expressed as the product of the 
function (29.13.7) with a power series containing the population value 
. k) t in a similar waj as in the case of the ordinary correlation 
coefficient (of 29.7.1). 

Let us finally consider the behaviour of the distribution of r* for 
large values of n. The variable nr* has the fr. f. 

414 




29.13 


r 


(4 講 i ) 誓 H ) ; 


-A—2 


When w — oo, this tends to the limit 

1 


(29.13.9) 


2 " 




k-3 _ 
: r 丁 


which is the fr. f. of a 欠翥 distribution with A: — 1 d. of fr. (cf 31.3, Ex. 7). 
Thus the distribution of r* does not tend to normality as «—» oo. 
Accordingly, we obtain from (29.13.8) 


E(r^). 


1c — 1 


DV) : 


2( 先一 1)(” 一 A ：) 
(n — 1)*(” + 1) 


0 ⑸， 


so that we have here an instance of the exceptional case mentioned 
at the end of 28.4，where the variance is of a smaller order than n~ l r 
and the theorem on the convergence to normality breaks down. This 
takes, however, only place in the case considered here, when the po¬ 
pulation value q is equal to zero. When p 〆0， the variance of r* is 
of order W 1 , and the diatribution approaches normality as w oo. 


415 



Chapters 30-31. Tests of Significance, I. 


CHAPTER 30. 

Tests of Goodness of Fit and Allied Tests. 

30.1. The test in the case of a completely specified hypothetical 
distribution. —— We now proceed to study the problem of testing the 
agreement between probability theory and actual observations. In the 
present paragraph, we shall consider the situation indicated in 26.2, 
when a sample of n observed values of some variable (in any number 
of dimensions) is given, and we want to know if this variable can be 
reasonably regarded as a random variable having a given probability 
distribution 

Let us denote as hypothesis H the hypothesis that our data form 
a sample of n values of a random variable with the given pr. f. P(S). 
We assume here that P(5) is completely specified, so that no unknown 
parameter appears in its expression, and the probability P(S) may be 
numerically calculated for any given set S. It is then required to 
work out a method for testing whether our data may be regarded as 
consistent with the hypothesis H. 

If the hypothesis H is true, the distribution of the sample (cf 25.3), 
which is the simple discrete distribution obtained by placing the mass 
1/w in each of the n observed points, may be regarded as a statistical 
image (cf 25 5) of the parent distribution specified by P(S). Owing 
to random fluctuations, the two distributions will as a rule not coin¬ 
cide, but for large values of n the distribution of the sample may be 
expected to form an approximation to the parent distribution. As 
already indicated in 26.2, it then seems natural to introduce some 
measure of the deviation between the two distributions, and to base 
our test on the properties of the sampling distribution of this measure. 

Such deviation measures may be constructed in various ways, the 
most generally used being that connected with the important x* test 
introduced by K. Pearson (Ref. 183). Suppose that the space of the 
variable is divided into a finite number r of parts S lf . • S r without 
common points, and let the corresponding values of the given pr. f. 

416 




30.1 


be Pl , . . , so that Pi = P[Si) and 2^ = 1. We assume that all 

l 

the pt are >0. The r parts S, may, e g, be the r groups into which 
our sample values have been arranged for tabulation purposes Let 
the corresponding group frequencies in the sample be v x , . . v ry so 

I 

that rt sample values belong to the set S ,， and we have 2 v, == 11 

i 

Our first object is now to find a convenient measure of the devia¬ 
tion of the distribution of the sample from the hypothetical distribu 
tion. Any set S t carries the mass vjn in the former distribution, 
and the mass p t in the latter. It will then be in conformity with 
the general principle of least squares (cf 15.6) to adopt as measure 

r 

of deviation an expression of the form ^Ci(v t /n — p t ) 1 where the 

i 

coefficients c, may be chosen more or less arbitrarily It was shown 
by K. Pearson that if we take c t — n!p, s we shall obtain a deviation 
measure with particularly simple properties. We obtain in this way 
the expression 

Thus jr* simply expressed in terms of the observed ficquonnes v, 
and the expected frequencies np t for all r groups. 

We shall now investigate the sampling distribution of % 2 , assumnuj 
throughout that the hypothesis H is true. It will be shown that we have 

<30.1.1) D*( z *) = 2(i -l) + ^2j _, '* -2r " 1 ' 2 

We shall further prove the following theorem due to K. Pearson 
{Ref. 183) which shows that, as the size of the sample increases, the 
sampling distribution of x 2 tends to a limiting distribution completely 
independent of the hypothetical pr. f. P(S). 

As n oo^ the sampling distribution of x Y tends to the distribution 
defined by the fr. f. 

(30.1.2) kr-i(x) = -7n—"7—_-,T : El— e —、 (x > 0) 

2 ~ r rV) 

417 



30.1 

studied in 18.1. — Using the terminology introduced in 18.1 and 29.2, 
we may thus say that, in the limit, % % is distributed in a x 1 distribution 
with r 一 1 degrees of freedom. 

At each of the n observations leading to the n observed points in 
out sample, we have the probability p x to obtain a result belongiDg 
to the set S t . For any set of non-negative integers v x , . v n such 

r 

that 2 Vi== ” ，化 6 probability that, in the course of n observations, 

i 

we shall exactly v t times obtain a result belonging to Si 、for 
t = 1, . . r, is then (cf Ex. 9, p. 318) 


n ! 

. . v r \ 


• • P v r ^ 


which is the general term of the expansion of (px + • + p r ) n - Thus 
the joint distribution of the r group frequencies v Xt • • •， y r is a simple 
generalization of the binomial distribution, which is known as the 
multinomial distribution. The joint c. f. of the variables v Xl . . . t v r is 

(Pl€ ili + - - + p r C 4 ^) n , 


as may be directly shown by a straightforward generalization of the 
proof of the corresponding expression (16.2.3) in the binomial case. 
Writing 


(30.1.3) 



1，2,…， r )， 


it is seen that the xt satisfy the identity 2 Xl ^ ❻， an d that we 

. l 

have 


x i= =2i x '- 

l 


Further, the joint c. f of the variables x v . x T is 


. . . f tr)==e 1 


iPie 


»Ji. 


+ Pr 


^ \ 


From the MacLaurin expansion of this function, we deduce by some 
easy calculation the expressions (30 1.1). We further find for any fixed 
《 1 ’ • • ” tr 


418 



log 炉 （匕， .• • ，匕 1 = w log 

— * ^ w V. ^ p, 

l 

— 中;+ !( 季 U 。;,) 4 0 (/ ri ), 

so that the c f. tends to the limit 

limy^ ，...， 卜■叫 ’]= 6 初 ,. . 〜） 

n-*oo 

The quadratic form Q(t x , . . , < r ) = 2 t t V has the matrix 

A l — pp\ where I denotes the unit matrix (cf. 11.1), while p de¬ 
notes the column vector (cf 11.2) p = (V p x> . . Vp r )- Replacing 
•… ， t r by new variables 鉍直， ...,w r by means of an orthogonal 

r 

transformAtion such that u r = = 2 ^ P'y we obtain (cf 11.11) 

i 

9 “1， . •，m ^ 2 c — (多 & 5=1 2 

It follows that ^(fj, . . t r ) is non-negfative and of rank ，.一 1 (cf 
11.6), and that the matrix A has r 一 1 characteristic numbers (cf 119) 
equal to 1, while the r th characteristic number is zero. 

As w -*■ oo, the joint c. f. of the variables x lt . . a? r thus tends to 
the expression e~^ y , which is the c. f. of a singular normal distribu¬ 
tion (cf 24 3) of rank r — 1， the total mass of which is situated in 
the hyperplane S x* I ^ =• 0. By the continuity theorem 10.7 it then 
follows that, in the limit, x ly . . •、 x r are distributed in this singular 
normal distribution, with zero means and the moment matrix A. It 

T 

then follow8 from 24,5 that, in the limit, the variable ^is 

l 

distributed in a % 8 distribution with r — 1 d. of fr. Thus the theorem 
is proved. 

By means of this theorem, we can now introduce a rest of the 
hypothesis H considered above. Let x\> denote the p % value of y} 

419 



30.1-2 


for r 一 1 d. of fr. (cf 18.1 and Table 3). Then by the above theorem 
the probability P = P(% % > Xp) will for large n be approximately equal 
to p %. Suppose now that we have fixed p so small that we agree 
to regard it as practically certain that an event of probability p % 
will not occur in one single trial (cf 26.2). Suppose further that n is 
so large that, for practical purposes, the probability P may be identi¬ 
fied with its limiting value p %. If the hypothesis H is true，it is then 
practically excluded that f in one single sample、tve should encounter a 
value of x' exceeding xl- 

If, in an actual sample, we find a value > xi, we shall accord¬ 
ingly say that our sample shows a significant deviation from the hypo¬ 
thesis H, and we shall reject this hypothesis, at least until further 
data are available. The probability that this situation will occur in 
a case when H is actually true, so that H will be falsely rejected, is 
precisely the probability P = P(x % > xp)i which is approximately equal 
to p %. We shall then say that we are working on a p % level of 
significance. 

If, on the other hand, we find a value x % ^ Xp- this will be re¬ 
garded as consistent with the hypothesis H. Obviously one isolated 
result of this kind cannot be considered as sufficient evidence of the 
truth of the hypothesis. In order to produce such evidence, we shall 
have to apply the test repeatedly to new data of a similar character. 
Whenever possible, other tests should also be applied. 

When the test is applied in practice, and all the expected fre¬ 
quencies np t are ^ 10, the limiting /.distribution tabulated in Table 
3 gives as a rule the value Xp corresponding to a given P = p/100 
with an approximation sufficient for ordinary purposes. If some of 
the npi are < 10, it is usually advisable to pool the smaller groups, 
so that every group contains at least 10 expected observations, be¬ 
fore the test is applied. When the observations are so few that this 
cannot be done, the tables should not be used, but gome informa¬ 
tion may still be drawn from the values of E(x 2 ) and D^ 2 ) calculated 
according to (30.1.1). 

Table 3 is only applicable when the number of d. of fr. is 
^ 30. Fof more than 30 d. of fr., it is usually sufficient to use 
Fisher's proposition (cf 20.2) that V2% 1 for n d. of fr. is approxi¬ 
mately normally distributed, with the mean K2 « — 1 and unit 8. d. 

30.2. Examples. — In practical applications of various tests of 
significance, the 5 %, 1 % and 0.1 % levels of significance are often 

420 



30.2 


used Which level we should adopt in a given case will, of course, 
depend on the particular circumstances of the case In the numerical 
examples that will be given in this book, we shall denote a value 
exceeding the 5 % limit but not the 1 % limit as almost significant 、 a 
value between the 1 % and 0 I % limits as significant 、aaid a value 
exceeding the 0 l % limit as highbj significant This terminology is, 
of course, purely conventional 

Ex. 1. In a sequence of u independent trials, the event K has oc¬ 
curred v times Are these data consistent with the h) pothesis that K 
has in every trial the given probability p — l — (j 0 

The data may be regarded as a sample of n values of a variable 
which is equal to 1 or 0 according as K occurs or not The hypo¬ 
thesis H consists in the assertion that the two alternatives have fixed 
probabilities p and q. Thus we have two groups with the observed 
frequencies v and n •— y, and the corresponding- expected frequencies 
np and nq. Hence we obtain 


(30.2 1) 


(v — n p) 2 
np 


—v —— n _(v — n ；))* 
n q npq 


By the theorem of the preceding paragraph, this quantity is for 
large n approximately distributed in a ^'distribution with one d 
of fr This agrees with the fact (cf 1(> 4 and 18 1) that the standardized 

variable v -~ n - is asymptotically normal (0, 1), so that its square has, 

I 

in the limit, the fr. f. k x (x). Accordingly, the percentage values of 
for one d. of fr. given in Table 3 are the squares of the corresponding 
values for the normal distribution given in Table 2 

In n = 4040 throws with a coin, Buff on obtained v — 2048 heads 

and n — v = 1992 tails Is this consistent with the hypothesis that 

there is a constant probability j? = i of throwing heads 9 一 We 

have here y 2 = = 0.770, and this falls well below the ^ % 

- npq 

value of x 55 for one d of fr., which by Table 3 is 3.841, so that the 
data must be regarded as consistent with the hypothesis. The corres¬ 
ponding value of P(^ 1 ^ 0.776) is about 0.38, which means that 
we have a probability of about 38 % of obtaining a deviation from 
the expected result at least as great as that actually observed. 

Ex. 2. Suppose now that k independent sets of observations are 
available and let these contain . . .. n* observations respectively, 

421 



30.2 


the corresponding numbers of occurrences of the event E beings 
vi t . . n- The hypothesis of a constant probability equal to p may 
then be tested in various wrjb. 

The totality of our data consist oP w = S observations with 
v = S occurrences, so that we obtain a first tent by calculating the 

quantity x 9 ― ^ * Further, the quantity Z* ^ provides 

a separate test for the t:th set of observations. 

Then Xu • ^ xl are independent, and for large ”，all hare acjmp- 

toticaliy the same distribution, vi*. the distribution with one d. of 
fr. By the addition theorem (18.1.7) the sum Sjrf has, in the limit, 
a x l distribution with k d. of fr., and this gives a joint test of all 
our values. 

Finally, when the are larjje, Xu - xi niay be regarded as a 
Sample of k observed values of a variable with the fr £. k x (a:), and 
we may apply the z* test to judjje the deviation of the sample from 
this hypothetical distribution 

In his classical experiments with peas, Mendel (Ref. 155) obtained 
from 10 plants the numbers of green and yellow peas given in Table 
30.2.1. According to Mendelian theory, the probability ought to be 
/ > = I for »jellow» t and q = i for »greens (the ， 3.1 hypothesis*). 
The ten values of aB well as the value ^* == 0 187 for the totals, 
all fall below the 5 % value for one d of fr. The aum of all ten 
Xt is 7.191, and this falls below the b % value for ten d. of fr., which 
by Table 3 is 18.807. Finally, the ten values of may be regarded 
as a sample of ten values of a variable with the fr. f. k x (x). For this 
distribution, we obtain from Table 3 the following probabilities: 

P(0 <x 9 < 0.148) =» 0.8, 

P(0.148 < Jt* < 1.074) =* 0.4, 

P(x* > 1.074) == 0.8, 

while according to the last column of Table 30 2.1 the corresponding 
observed frequencies are respectively 2, 6 and 2. The calculation of 
X % for this sample of n = 10 observations with r == 3 g^roups gives 
= (2 — 3)V3 + (6 — 4)*/4 + (2 — 3)*/3 = 1.667. In this case, the ex¬ 
pected values are so small that the limiting distribution should not be 
used, but we may compare the observed value x a = 1.667 with the values 
^ (z 1 ) ― 2 and D(x f ) = 1.902 calculated from (30.1.1). Since the ob¬ 
served value only differs from the mean by about 18 % of the s. d. y 
the agreement must be regarded as good. 

422 



Table 30.2 J 


30.2 


Plant number 

Namb«r of peM 

z! 

Yellow 

y i 

Qr««n 

n i- v i 

Total 

n i 

1 

26 

11 

36 

O.ftM 

2 

82 

7 

89 

1.084 

S 

14 

6 

10 

0.018 

4 

70 

27 

97 

0.416 

6 

24 

13 

37 

2 027 

6 

20 

0 

26 

0 Obi 

7 

82 

18 

46 

0.M8 

8 

44 

9 

63 

1.818 

9 

60 

14 

« 

0.8M 

10 

44 

18 

62 

0 5S8 

Total 

366 

123 

478 

7.m 


X* for the totals — 0.187 


Thus all our tests imply that the data of Table 30.2.1 are con- 
aiatent with the 3:1 hypothesis. If either test had disclosed a signi¬ 
ficant deviation, we should have had to reject the hypothesis, at least 
until further experience had made it plausible that the deviation was 
due to random fluctuations. 

Ex. 3. In another experiment, Mendel observed simultaneously the 
shape and the colour of his peas. Among n = 556 peas he obtained ： 

Round and yellow. 315， (expected 312.76 )， 

Round and green.108, ( 、 104.26), 

Angular and yellow.101, ( » 104.26), 

Angular and green. 32, ( » 34,76), 

where the expected numbers are calculated on the hypothesis that 
the probabilities of the r = 4 groups are in the ratios 9 3 3. 1. From 
these numbers we find — 0,470. We have r — 1 3 d of fr., and 
by Table 3 the probability of a % % exceeding 0.470 lies between 90 
and 95 %, so that the agreement is very good. 

Ex. 4. We finally consider an example where the hypothetical 
distribution is of the continuous type. Aitken (Ref. 2, p. 49) ^ives the 

423 













30.2 - 3 

following distributions of times shown by two samples of 500 watches 
displayed in watch makers' windows (hour 0 means 0—1, etc .)： 


Table 30.2.2. 


■jiiimQjimiii 

D 

D 


H 

D 

D 

D 

D 


D 

m 

n 

Total 





3 9 

40 


R 



H 


89 

600 



_ 

m 

47 

40 

D 


m 


41 

37 

48 

500 


On the hypothesis that the times are uniformly distributed over the 
interval (0, 12), the expected number in each class would be 500/12 = 
= 41,67, and hence we find x? ^ lO.ooo for the first sample, and 
xl = 8.032 for the second, while for the combined sample of all 1 000 
watches we have x % ^ 9-464. In each case we have 12 — 1 = 11 d. 
of fr” and by Table 3 the agreement is good. We may also consider 
the sum ^ 18 . 082 , which has 22 d. of fr , and also shows a 

good agreement. 


30.3. The % % test when certain parameters are estimated from 
the sample. 一 The case of a completely speciBed hypothetical distri¬ 
bution is rather exceptional in the applications. More often we en¬ 
counter cases where the hypothetical distribution contains a certain 
number of unknown parameters, about the values of which we only 
possess such information as may be derived from the sample itself. 
We are then given a pr. f. P{S\ a,, . . a f ) containing 8 unknown 
parameters a lf . . a 9i but otherwise of known mathematical form. 
The hypothesis H to be tested will now be the hypothesis that our 
sample has been drawn from a population having a distribution deter¬ 
mined bj the pr. f. P, with some values of the parameters aj. 

As in 30.1, we suppose that our sample is divided into r groups, 
corresponding to r mutually exclusive sets S lt . . S r 、 and we denote 
the observed group frequencies by v,, . . v r , while the corresponding 
probabilities are p,(o,, . . o,) = P(St\ a u . . a,) for t = 1, 2, . . r. 

If the »true values* of the aj were known, we should merely have 
to calculate the quantity 


(30.3.1) 


Z ，= = 2 


[vi — ?i pi (q, . q,)] 1 

^pi[o\y • • • a$) ’ 


424 








30.S 


and apply the test described in 30.1, so that no further discussion 
would be required. 

In the actual case, however, the values of the aj are unknown and 
must be estimated from the sample. Now, if we replace in (30.3 1) 
the unknown constants a } by estimates calculated from the sample, 
the pi will no longer be constants, but functions of tfie sample values, 
and we are no longer entitled to apply the theorem of 30.1 on the 
limiting distribution of X s . As already pointed out in 26.4, there will 
generally be an infinite number of different possible methods of estima¬ 
tion of the a h and it must be expected that the properties of the 
sampling distribution of ^ will more or less depend on the method 
chosen. 

The problem of finding the limiting distribution of under these 
more complicated circumstances was first considered by R. A. Fisher 
(Ref. 91, 95), who showed that in this case it is necessary to modify 
the limiting distribution (30.1.2) due to K. Pearson. For an im¬ 
portant class of methods of estimation, the modification indicated by 
Fisher is of a very simple kind. It is、in fact, only necessary to reduce 
the number of d. of fr. of the limiting distribution (30.1.2) by one tout 
for each parameter estimated from the sample. 

We shall here choose one particularly important method of estima¬ 
tion, and give a detailed deduction of the corresponding limiting 
distribution of x 2 . It will be shown in 33.4 that there is a whole 
class of methods of estimation leading to the same limiting distribution. 

It seems natural to attempt to determine the 》 best» values of the 
parameters aj so as to render yj defined by (30.3.1) as small as pos¬ 
sible. This is the x 2 mimmum method of estimation. We then have 
to solve the equations 


(30.3.2) 




where i = 1, 2, , s 、with respect to the unknowns a,, . . and 
insert the values thus found into (30.3,1). The limiting distribution 
of for this method of estimation has been investigated by Neyman 
and E. S. Pearson (Ref. 170), who used methods of multi-dimensional 
geometry of the type introduced by R. A. Fisher. We also refer in 
this connection to a paper by Sheppard (Ref. 213). 

Even in simple cases, the system (30.3.2) is often very difficult to 
solve. Ifc can, however, be shown that for large n the influence of 

425 



30.3 


the second term within the brackets becomes negligible. If, when 
differentiating x 9 with respect to the a,, we simply regard the denomi¬ 
nators in the second member of (30.3.1) as constant, (30.3.2) is replaced 
by the system 


(30.3.3) 


~ w p, dpi 
Pi Oaj 


= , s\ 


and usually this will be much easier to deal with. The method of 
estimation which consists in determining the a 3 from this system of 
equations will be called the modified x s minimum method. Both methods 
give, under general conditions, the same limiting distribution of % % 
for large ", but we shall here only consider the simpler method based 
on (30.3.3). 

By means of the condition a) of the theorem given below, the equations (30 3.3) 
reduce to 


which may also be written 


: 0, where L ■ 


The method of estima¬ 


tion which consists in determining the such that L becomes as large as possible 
iH the maximum hkelthood method introduced by K A. Fisber, which will be further 
discussed in Ch 33 With respect to the problem treated in the present paragraph, 
the modified y* minimum method is thus identical with the maximum likelihood 
method. The latter method is, however, applicable also to problems of n much more 
general character. 

On account of the importance of the question, we shall now give 
a deduction of the limiting distribution of % 2 under as general condi¬ 
tions as possible, assuming that the parameters a } are estimated by 
the modified ? minimum method We first give a detailed statement 
of the theorem to be proved. 

Suppose that we are given r functions (a,, . . a,), . . price^ .. a t ) 

of s < r variables a v . . ” a, such that, for all points of a non-degenerate 
interval A in the s dimensional space of the a Jf the p t satisfy the following 
cotiditions: 


a ) 2 Pi( a ly - fff) = 1 . 


b) p t (a,, ..,<?•) > c 2 > 0 for all i. 

426 



30.3 


c) Every p t has continuous derivativesand 

a a, oajOak 

<1) The matrix D = p tvhere e = 1, . . . r and j == 1,.. 
of rank s 


Let the possible results of a certain random experiment be divided 
into r mutually exclusive groups, and suppose that the probability oj 
obtaining a result belonging to the i:th group is p» =pi(a° t . . M aS), 
where a 0 == (a?, . . M oj) is an inner point of the interval A. Let v t denote 
the number of results belonging to the i:th group, which occur in a se 

r 

quence of n lepetitions of te, so that ^ w - 

i 

The equations (30 3.3) of the modified minimum method then have 
exactly one system of solutions a = (a 4 , . . a $ ) such that a converges in 

probability to a Q as n co The value of x l obtained by inserting these 
values of the a 3 into (30.3.1) ts } in the limit as n distributed m a 

X^-distnbutton with r 一 — 1 degrees of freedom. 

The proof of this theorem is somewhat intricate, and will be 
divided into two parts In the first part (p 427 一 431) it will be 
shown that the equations (30 3.3) have exactly one solution a inch 
tbat a converges in probability (cf 20.3) to « 0 . In the second part 
(p 431 —-434) we consider the variables 

(30 3 4) (, = 1 ， , )•), 

【 //p, (nr,, . , n t ) 


where a ― la,, , or 、） is the solution of (30 3 3)，the existence of 
which has just been established It will be shown here tbat, as 
n -> oo, the ioint distribution of the y t tends to a certain singular 
normal distribution of a type similar to the limiting distribution of 
the variables jTi defined by (30 1.3) As in the corresponding- proof in 

r 

30 1， the limiting distribution of ^ then directly obtained 

i 

from 24 5 


Throughout the proof, the subscript i will assume the values 
1 2. r. while j and k assume the values 1, 2, . . ^ 

We shall first introduce certain matrix notations, and transform 


the equations (30 3.3) into matrix form. 


Denoting by tlie value 


427 



30^ 


assumed by in the point a 0t (30.3.3) may be written 
◎ a J 

(303.5) - 忒轉 ( 詨 ) 。 (|2)。 = * 

where 

一 = [货-以 玆)。]- 

(30.3.6) — ; (P.- 批 )[ 读 - ☆ ( 耙 ) J - 

— 掉⑽ 。卜 -1 ( f £)。 (…吐 


Let us denote by B the matrix of order r. s 


•( 乾 )。 




By 11.1，we have B = P 0 D 0 , where P 0 is the diagonal matrix formed 
by the diagonal elements - - •» pp=;» while D 0 is the matrix ob¬ 
tained by taking aj = aj in the matrix D = Hence by condition. 


d) the matrix B is of rank s (cf 11.6). 一 We further write in analogy 
with (30.1.3) 


(30.3.7) 


Xi 



and denote by «, « 0 , o>(«) and x the column vectors (cf 11.2) 

a = (a„ . . a,), 

«o = W, a!)» 

io(a) = (w^Ca), . . ca,(a)), 

龙 ss ： {x^ % . . Xr), 


the three first of which are, as matrices, of order s • 1, while the 
fourth is of order r • 1. 


428 




30.3 


In matrix notation y the system of equations (30.3.5)，where 
j = 1, . . may now be written (cf 11.3) 

B f B (a — a 0 )= ，广 I + a>(ce). 

B f B is a symmetric matrix of order 8'S y which according; to 11.9 is 
non-singular, so that the reciprocal (B'B)" 1 exists (cf 11.7), and we 
obtain 

(30.3.8) « = « 0 -f B)- 1 B x + (B r B}** 1 o> («). 

This matrix equation is thus equivalent to the fundamental system of 
equations (30.3.3). 

For every fixed i the random variable has the mean njp ； and 
the 8. d. Vnp° t (l — p?), so that by the Bienaym^-Tchebycheff inequality 
(15.7.2) the probability of the relation | — npt kVn is at most 

equal to Pi < p* Consequently the probability that we have 

1 y* — w pj I 2 又 K w for at least one value of i is smaller than 
= an d, conversely, with a probability greater than 1 —又 

s 

we have 

(30.3.9) \v t — np*l\ < XVn for all t = 1, . . r 

Until further notice, we shall now assume that the k satisfy the 
relations (30.3.9). We shall here allow X to denote a function of n 
such that X tends to infinity with w, while }?lV~n tends to zero. We 
may e. g. take l = n q t where 0 < g < 一 All results obtained unde}- 
such assumptions will thus be true with a probability which is greater 
than 1 — it" 2 , and which consequently tends to 1 as n co. 

From (30.3.7) we then obtain by condition b) 

(30.3.10) |x,|<-- 

c 

Further, when a = («i，• • . ， a:) and a f = (a?, .. a#) are any points in 

*) Note that w® cannot write here since by hypothesis 

i < r y so that B is not square, and the reciprocal B-i i« undefined. — If we take 
* *= r, it will be seen that the conditions r) and d) of the theorem are incompatible. 
In this case, if we assume that »)—c) are satisfied, the matrices D t B and B are 
all Bingnlar, bo that the reciprocal (B^ B)-l it undefined, and (80.3.8) has no sense. 

429 



30*3 


the interval A, we obtain from (30.3.6) after some calculations, using* 
the conditions b) and c), and expanding in Taylor s series, 

(30.3.11) I a>j (a) — c*v(ct")| ^ K x \a — a n \' || a—a 0 1 + | a"— <?。| + - 

In the second member, we use the notation \a — b \ for the distance 
(cf 3.1) between two points a and b in the ^-dimensional space of the 
dj 、while K t is a constant independent of a\ a \ j and n. 

We now define a sequence of vectors a v = (a^, . . a| v) ) by writing 

for y = 1 , 2 ,. .. 

(30.3.12) «, = « 0 + B)~ l B'x + (B' B)~ l 10 

and we propose to show that the sequence « 2 i • • • converges to 
a definite limit a, which is then evidently a solution of (30.3.8). By 
(30.3.6) we have to (« 0 ) = 0, and thus 

(30.3.13) «,-««=«-* 
while for v > 0 

(30.3.14) «，+i — a v — (B' B)"* 1 [co (a v ) 一 oj («»-i)]. 

Now the matrices (B f B)" 1 B" and (B'B) -1 are both independent of n. 
Denoting by g an upper bound of the absolute values of the elements 
of these two matrices, it then in the first place follows from (30.3.13) 
and (30.3.10) that every element of the vector a x — « 0 satisfies the 
inequality 

|o ； ,> -o,°| < r i 1 广 ' 

1 J Ji c. \r n 

so that 

I a, — « 0 1 < » 

V n 


where K t is independent of n. In a similar way, it then follows from 

(30.3.14) and (50.3.11) that we have 


I «*+i - I ^ 1«v — 一 11 • — « 0 1 + I - a。I + 


for every y > 0, where K s is independent of v and n. From the two 
last inequalities, it now follows by induction that we have for all 
sufficiently large w, and for all y = 0, 1, 2,... 

430 



303 


(30.3.15) I 1 - I ^ Jf, [(4 K t +1) Jf,]* (^) ，+1 

Since by hypothesis a 0 is an inner point of the interval A % it follows 
that for all sufficiently large n the vectors a u . (considered as 

points in the «-space) all belong to A, and that the sequence a,, « 8 ,... 
converges to a definite limit 

(30.3.16) a = « 0 -f («j — a 0 ) + («! —- «i) + • 

which, as already observed, is a solution of (30.3.8), and thus also of 
the fundamental equations (30.3.3). It follows from (30.3.15) that 
« « 0 as w 一 》. Moreover, a is the only solution of (30.3.8) tending 

to « 0 as » -► oo. In fact, if a is another solution tending to a 0> we 
have 

a -a^ (B f B)- 1 («>(«)-«>(«)), 
and by the same argument as above it follows that 

\a — a \ ^ K n \a — a\- ^\a — a 0 \ +- |« — « 0 1 + ^ 二 )’ 

where the expression within the brackets tends to zero as n oo r 
bat this is evidently only possible if a = a for all sufficiently large n. 

All this has been proved under the assumption that the relations 
(30.3.9) are satisfied, and thus holds with a probability which is 
greater than 1 一又 and consequently tends to 1 as w oo. We 
have thus established the existence of exactly one solution of (30.3.8), 
or (30.3.3), which converges in probability to « 0 , and the first pari 
of the proof is completed. 

Still assuming that the relations (30.3.9) are satisfied, we obtain 
from (30.3.8), (30.3.13) and (30.3.16) 

(B f B)- 1 co («) = « — a x = («j 一 «,) + (a 8 一 a $ ) H- 

It then follows from (30.3.15) that every component of the vector 
o>(«) is smaller than K r Z*/w, where K r is independent of n r 
so that (30.3.8) may be written 

(30.3.17) = 

n 

where 8 t = (^i, . . ff t ) denotes a column vector such that (^| ^ 1 for 

j = 1, . . s. 


431 



303 

Consider now the variables y t defined by (30.3.4). Still assuming 
that the relations (30.3.9) are satisfied, we obtain by means of (30.3.7), 
(30.3.10) and (30.3.17) 

Vl = - y-^PinPj + /J_L\ 

y Vpf) 

… )/^(筘) 少广 + o ㈣ . 

Expressing this relation in matrix notation, we obtain 

广 X f, A* 

y = x—V nB(a — a 0 ) + —— 0 S , 

where y (y i% . . y r ) and 0 t = (ffi, . . ffr) with | 沒 ;'| S 1， while K n is 
independent of n. Substituting here the expression (30.3.17) for a — a 0t 
we obtain 

v yt 

y = x - B(B l B)~ l x ^ ~ 0 
(30.3.18) ^ 

V n 

where / is the unit matrix of order r • r, and 6 = (6^ . . 0 r ) with 

I 1 ^ 1, while K is independent of w. 

We now drop the assumption that the relations (30.3.9) are satis¬ 
fied, and define a vector % = (z lt . . z r ) by writing 

y — Ax 

where A denotes the symmetric matrix 

It then follows from (30.3.18) that, with a probability greater than 
1 — ^~ a , we have \z%\^ K for all i y so that m converges in 

probability to zero. Further, ifc has been shown in 30.1 that the 
variables x r are, in the limit aa w -► oo, normally distributed 

with zero means and the moment matrix A = I — p p\ where 
/> = • • . ， I 7 ^). By the last proposition of 22.6 it then follows 

that the limiting distribution of y is obtained by the linear trans¬ 
formation y = Ax } where * = (x u . . M x r ) has its normal limiting 
distribution, with the moment matrix A of rank r — 1. 


432 



30.3 


By 24.4, the joint limiting distribution of y it . . y r ia thus nor¬ 
mal, with zero means and the moment matrix 

AAA f = [I- BiB' [I-pp] B)~ l B']. 

Now by condition a) the jXh element of the vector B 1 p is 



.o that B p is identically *ero. Hence we find on ravltiplication that 
the moment matrix of the limiting y-distribution reduces to 

(30.3.19) AAA=I-pp-B(B ， 

It now only remains to show that this aymmetric matrix of order 
r - r has r 一 没一 1 characteristic numbers equal to 1, while tbe rest 
are 0, so that the effect of the last term ia to reduce the rank of 
the matrix by s units. It then follows from 24.5 that the sum of squares 

i 

is, in the limit, distributed in a j^-diatribution with r 一 s — 1 degrees 
of freedom, so that our theorem will be proved. 

For this purpose we first observe that, by 119, the .v characteristic 
numbers Xj of the iymmetric matrix B' B are all positive. Writing 
X ； = where > 0, and denoting by M the diagonal matrix formed 
by the diagonal elements n， } we may thus by 11.9 find an 

orthogonal matrix C of order s - .»■ such that C f B C = M 1 , and 
hence (灰 JB)— 1 = (CiVf* C')， 1 = CM- 1 . Af— 1 C'. It follows that 

(30.3 20) BCM 1 C = HH\ 

where B CM" 1 is a matrix of order r - s such that 

denoting here by / the unit matrix of order s*s. The last relation 
signifies that the 8 columns of the matrix H satisfy the orthogonality 
relations (11.9.2). Further, we have shown above that Bp = 0, and 
hence H*p = M" 1 C f B' p = 0. Thus if we complete the matrix H by 
an additional column with the elements the •、+ 1 co 

hunna of the new matrix if, wiil still satisfy the orthogonality n»la 

433 



30.3-4 


tions. Since s < r 、 we may then by 11.9 find an orthogonal matrix 
K of order ;• - r, the 5+1 last columns of which are identical with 
the matrix H,. 

Then K'P is a matrix of order r • 1, i. e. a column vector, and it 
follows from the multiplication rule that we have K f p = (0, . . 0, 1). 

Thus the product K f ppK — (0, . . 0, 1) • {0, . .,0, 11 is a matrix of 

order r - r, all elements of which are zero, except the last element of 
the main diagonal, which is equal to one. 一 In a similar way it is 
seen that the product K'HH'K is a matrix of order r - r, all elements 
of which are zero, except the «9 diagonal elements immediately preceding 
the last, which are all equal to one. 

By (30.3 20), the moment matrix (30.3.19) now takes the form 
I — pp — HH\ It follows from the above that the transformed 
matrix K’ (J 一 pp — H H') K is a diagonal matrix, the r —- s — 1 first 
diagonal elements of which are equal to 1, while the rest are 0. Thus 
we have proved our assertion about the characteristic numbers of the 
moment matrix (30,3.19). As observed above, this completes the proof 
of the theorem. 

By means of this theorem, we can now introduce a test of the 
hypothesis H in exactly the same way as in the simpler case con¬ 
sidered in 30 1. Some examples of the application of this test will 
be shown in the following paragraph. 

30.4. Examples. — We shall here apply the % 55 test to two parti¬ 
cularly important cases, viz. the Poisson and the normal distribution. 
Other simple distributions may be treated in a similar way. 

Ex. 1. The Poisson distribution. Suppose that it is required to test- 
the hypothesis that a given sample of n values x ly . . . y x n is drawn 
from some Poisson distribution, with an unknown value of the para¬ 
meter 又 . Every Xfi is equal to some non-negative integer i, and we 
arrange the according to their values into r groups, pooling the 
data for the smallest and the largest values of where the observa¬ 
tions are few. Suppose that we obtain in this way 

n observations with x ^ k, 

vi 、 » x = z, where i = k 1, • • •，灸 + r —— 2, 

n*+r-i » » x ^ k -{• r 一 1. 

又 * 

If we write / ur l = P(x — l) = the corresponding probabilities 


are 


434 



30.4 


Pk /’ (无 S 走) = 2 你， 

o 

p, = P(x — i) — trr, for < = X ： + 1, .... ^ 4- r — 2, 


pMr~i = 尸(工 ^Ar-fr—l)-=2^ 

* 十 r- 1 


In order to estimate the unknown parameter X by iho modified x 2 
minimum method, we have to solve the syntem (30.3.3)，or ilie equi¬ 
valent system (30.3.3 a). Since thoro is only one unknown parameter, 
we have .9=1, so that each system reduces to one single liquation, 
and (30.3.3 tt) gives 


2( 卜 1 卜 

o _ _ 

0 


* + r- I 


This equation haa a single root X — 厂 ， whore 


r 




o 


l + >**■ + r - I J 


2 ， 

>r ^-r— I 
oo 


Here, the second term within the braeketn is equal to the sum of all 
such that ^ < .r" < 々十 r 一 1， whilp the first and the last, term 
give approximately the sum of all x fi which are ^ k or g A ： + r — 1 
respectively. The estimate A* to be uaed for X is thus approximately 
equal to tho arithmetic mean of tbe sample values: 


r 




Taking • 、、二 1 in the theorem of the preceding paragraph, we find that 
the limiting x* • 山 has in this case r 一 2 d of fr 

In Table HO 4 1, throe numerical examples of the application of 
the teat are shown. Ex. I a) gives the numbers of a-particles radiated 
from a disc in 2(>08 poriods of 7. ft seconds according to Rutherford 
and Geiger (Ref. ( 2 y p 77). Ei. 1 b) ^ives the numbers of red blood 


435 



30.4 


Table 30.4.1. 


Application of fche x 2 test to the Poisson distribution. 


i 

Ex. 1 a) 

Ex. 1 b) 

Ex. 1 e) 

No. of 
periods 
with i 
a par¬ 
ticles 

npt 

(心一 

No. of com 
partment8 
with x 
blood- 
corpuscles 

n Pi 


No. of 
planus 
with i 
flowers 

v i 

nPi 


n pi 

n Pi 

npi 

0 

67 

64 m 

0.1244 







1 

203 

210.628 

0 2 娜 







2 

383 

407.801 

1 .讎 







3 

626 

625.496 

0 0005 




6 



4 

632 

608 418 

1.0988 

1 



2 



6 

408 

393 516 

0 5882 

8 



10 

25.0217 

2.6717 

6 

273 

253 817 

1 4498 

5 



19 

19.1860 

0.0010 

7 

139 

140.325 

0.0125 

8 

16.7955 

0.0919 

20 

24 1934 

0.7268 

8 

45 

67.882 

7.7132 

13 

11.4048 

0.22S8 

42 

26.7689 

8 6786 

9 

27 

29 189 

0.1642 

14 

16.0980 

0.0792 

27 

26.5178 

Oom 

10 

10 

17 075 

0O677 

15 

17.9773 

0.4981 

25 

23 2918 

0.1264 


4 



16 

19.4661 

1.0247 

23 

18.7889 

0 9689 

12 

2 



21 

19.8217 

0 1458 

11 

13 8199 

0 5764 

13 




18 

17.7082 

0.0050 

6 

22 7171 

1.9861 

14 




17 

15.0616 

0.2495 

6 



16 




16 

11.9599 

1 8648 

4 



16 




0 

8.9084 

0.0010 




17 




6 

16.8140 

0.8282 




18 




3 






19 




2 






20 




2 



1 



21 




1 






I Total 

2608 

2608 000 

12 8S49 


169.0000 

4_ 

200 

200 .oooo 

16.6466 


i — 8 870 


X =» 11.911 


X — 8.850 



X* ~ 12.885 (9 d. of fr.) 

x f = 4.008 (9 d. of fr.) 

义 ， = 16.847 (7 d. of fr.) ^ 


P- 0.17 


P = 0.91 


V- 0.08 



436 









30.4 


corpuscles in the 169 compartments of a haemacjtometeir observed 
by N. G. Holmberg. Ex. 1 c) gives the numbers of flowers of 200 
plants of Primula veris counted by M.-L. Cram 谷 r at Uto in 1928. Ac¬ 
cording to the rule given in 30.1, the tail groups of each sample 
have been pooled so that every group contains at least 10 expected 
observations. Thus e. g. in 1 b) the observed frequency in the groups 
i = 7 and *• = 17 are respectively 1 十 3 + 5 + 8=17 and 6 + 3 + 
+ 2 + 2 + 1 = 14 . — The agreement is good in a), and even very 
good in b), while in c) we find an 》 almost significant» deviation from 
the hypothesis of a Poisson distribution, which is mainly due to the 
excessive number of plants with eight flowers. 

The cases considered above are representative of classen of Tariables which often 
agree well with the Poisson distribution. 一 When the data show a Bigniflcant devia¬ 
tion from the Poisaon distribution, the agreement mny sometimes he considerably 
improved by introducing the hypothesis that the parameter X itself is a random 

a* 

variable, distributed in a Pearaon type III distribution with the fr. f. ~pJ^ xn ~ X c ^ aX » 

(x > 0), where a and x are positive parameters. In this way we obtain the negative 
binomial distribution (cf Ex. 21, p. 269), which has interesting •pplicfttioni e.g. to 
accident and sickness sUtUtics (Greenwood and Yale, Ref. 119, Eggenberger, Ref. 81, 
Newbold, R®f. 169 a), and to problems connected with the number of individuals be¬ 
longing to given species in samples from pUnt or animal populations (Gneroth, Ref. 
81 •; Fisher, Corbefc and Williams, Kef. 111). In the case of accident data, the in- 
trod action of a variable X m 篡 y be interpreted m a way of taking account of the 
variation of risk among the members of a given population. Analogous interpreta¬ 
tion* may be advanced in other cases. The subject may also be considered from the 
point of view of random proce8$e§ (cf Lundberg, Ref. 162). 

Ex. 2. The normal distribution. Let a sample of n values X \ y . . .. x n 
be grouped into r classes, the i:th class containing- v\ observations 
situated in the interval ( 仏 —J A, J h\ where = + (? — l)h. 

We want to test the hypothesis that the sample has been drawn from 
some normal population, with unknown values of the parameters m 
and a. If the hypothesis is true, the probability p x corresponding to 
the i:th class is 

1 C 卜 
Pi = I 

a V ^TtJ 

where the integral is extended over the t:th class interval. For 
the two extreme classes (i = 1 and i = r), the intervals should be 
(— °°, 4 - i h) and (J r — 1 A, *f oo) respectively. We then have, writing 

for brevity g (x) = e ~~ 20 * , 


437 



30.4 


j (o? — m) M g(x) dx 
f g(x)dx 


We first assume fchat the grouping has been arranged such that the 
two extreme classes do not contain any obaerved values. We then have 
= y f = o. For small values of h y an approximate ■olution may be 
obtained simply by replacing the functions under the integrals by 
their values in the mid-point ^ of the corresponding class interval. 
In this way we obtain estimates m* and cr* given by the expressions 

w*=^2 a * i== m *) f - 

1 i * 

Thus m 0 and a* 1 are identical with the mean £ and the variance s f 
of the grouped sample, calculated according to the usual rule (cf 27.9) 
that all sample values in a certain class are placed in the mid point 
of the class interval. — In order to obtain a closer approximation, 
we may develop the functions under the integrals in Taylor's series 
about the mid point For small h, we then find by some calculation 
that the above formulae should be amended as follows: 

m. = $ 2 ❿ + 0(A ‘)， - ^ + 0(n 

i i 

Neglecting terms of order h' we may thus use the mean of the 
grouped sample as our estimate of m, while Sheppard's correction (cf 
27.9) should be applied to the variance. 

Even when h is not very small, and when the extreme classes are 
not actually empty, but contain only a small part of the total sample, 

438 


dm 

dpi 

do 




y=j(x-m)g{x)dx 1 


The equations (30.3.3 a) then give after some simple % reduction«, all 
integrals being extended over the respective class interval* specified above, 


m = 


1 "V / xg(x)dx 

" 77 ^’ 


2 - 
1 :n 

II 

t 



30.4 


the same procedure will lead to a reasonable approximation. 一 In 
practice, it is advisable to pool the extreme classes of a given sample 
according to the rule given in 30.1, so that every class contains at 
least 10 expected observations. Our estimates of m and a* should 
then if possible be the values of x and s a calculated from the original 
grouping, before any pooling has taken place, and with Sheppard s 
correction applied to s 2 . If r is the number of classes after the pooling, 
and actually used for the calculation of x*，the limiting distribution 
of x 1 has r — 3 d. of fr.，since we have determined two parameters 
from the sample. 

When the parent distribution is normal, asymptotic expressions for 
the means and variances of the sample characteristics g v and g t have 
been given in (27.7.9)，while the corresponding exact expressions are 
found in (29.3.7). A further test of the normality of the distribution 
is obtained by comparing the values of g x and g t calculated from an 
actual sample with the corresponding means and variances. 

Table 30.4.2. 


Distribution of mean temperatures for June and July in Stockholm 

1841—1940. ~ 


June 

. —._......._ . .. 

.July 

Degrees 

Celsius 

Observed 

Expected 

Decrees 

Celsius 

Observed 

Expected 

一 12.4 

10 

12 89 

一 14 9 

11 

1041 

125 — 129 

12 

7 89 

15.0—16 4 

7 

6 72 

13.0-134 

0 

10.20 

16 5-169 

8 

0 00 

13 5-139 

10 

11.98 

16 0 — 16 4 

13 

10 nr> 

14 0—14 4 

19 

12 62 

165-10.9 

14 

12 1*2 

14 5-14.9 

10 

12 08 

17 0-17.4 

13 

] 2 20 

16.0 一 16.4 

9 

10 46 

17 5—17 9 

6 

11 16 

16.5-— 15.9 

e 

8.19 

180-184 

9 

9 28 

16 0—16.4 

7 

6 81 

186-189 

7 

7 02 

10 5 - 

8 

7.98 

19 0— 

12 

11.14 

Total 

100 

100 00 

Total 

100 

lOO.oo 1 

1 




69.S88485 

10 3 0 


^ltxJP 


1.670062 




d. 

(7 

.86.S5 


28, 

4 

II 


439 











30.4 


Table 30.4.3. 

Breadth of beans, ^ 6.826 mm, k = 0.26 mm. 



Table 30.4.2 shows the result of fitting normal curves to the distri¬ 
butions of mean temperatures for the months of June and July in 
Stockholm during the n = 100 years 1841 一 1940. In the original data, 
the figures are given to the nearest tenth of a (frade, so that the 
exact clasa intervals are (12.46, 12.es) etc. We have here used some¬ 
what smaller gproups than is usually advisable. Both values of % % 
indicate a satisfactory agreement with the hypothesis of a normal 
distribution. The values of g x and g % are also given in the table. On 
the normal hypothesis, the exact expressions (29.3.7) give in both 
cases E (夕 |) = 0 ， D(g t ) = 0.288, and E (g t ) — 一 0.069 ， D(^ f ) = 0.466, so 
that none of the observed values differs significantly from its mean. 

440 


















30.4-5 


A diagram of the sum polygon for the June distribution (drawn 
from the 100 individual sample values), together with the corresponding 
normal curve, has been given in Fig. 25, p. 328. 

When g x or g t have significant values, the fit obtained by a normal 
curve may often be considerably improved by using ： the Charlier or 
Edgeworth expansions treated in 17.6 一 17.7. We must then bear in 
mind that, for every additional parameter determined from the sample, 
the number of d. of fr. should be reduced by one. 

Table 30.4.3 shows the distribution of the breadths of w == 12 000 
beans of Phaseolus vulgaris (Johannsen's data, quoted from Charlier, 
Ref. 9, p. 73). On the hypothesis of a normal distribution, we have 
E(^ t ) = 0, D(g t ) = 0.0224, and E ( 夕 ，）=— 0.0006 ， D(g t ) = 0.0447, so that 
the actual values of g t and g t given in the table both differ signifi¬ 
cantly from the values expected on the normal hypothesis. 

The table gives also the expected frequencies and the corresponding 
values of z*, calculated on the three hypotheses that the fr. f. of the 


standardized variable 
(17.7.3) or (17.7.5), 
a) »normal» ... 


x — x 
8 


is, in accordance with the expansion 




b) »first approx.* . 

c) »second approx.» . • . gp(x) — g 炉⑻ (a:) + 


In the first two cases, the deviations of the sample from the hypo¬ 
thetical distributions are highly significant, the values of P being- 
<(). 001 ，while in the third case we have P = 0.19, so that the agree¬ 
ment is satisfactory. 一 In Fig. 26, p. 329, we have shown the histo¬ 
gram of this distribution, compared with the frequency curve for the 
♦second approx.». More detailed comparisons for this and other 
examples are given by Cramer, Ref. 70. 

30.5. Contingency tables. — Suppose that the n individuals of a 
sample are classified according to two variable arguments (quantitative 
or not) in a two-way table of the type shown in Table 30.5.1. 

A table of this kind is known as a contingency table, and it is 

*) By the same method as above, it is shown that the estimated to be used for 
the coefficients y, and are g x and g it as calculated from the grouped sample, using 
Sheppard's corrections. 


441 





30.5 


often required to test the hypothesis that the two variable ar^umeoti 
are independent. Denote by ptj the probability that a randomly chosen 
individual belongs to the i:th row ftnd the j.th column of the table. 


Table 30.5.1. 


Arguments 

1 

2 .. 

..f 

ToUl 

1 


v,* • •. 

• * V ll 

y l. 

a 


v •羹 • • • 

• * V 7» 

V 2. 

r 


V r2 • • 

• • V " 

V r. 

ToUl 


v .9 * * 

• • V .§ 

n 


The hypothesis of independence is then (cf 21.1.4) equivalent to the 
hypothesis that there exist r + s constants p(. and p.j such that 

Pij^Pi.Pjy 

1 - 

i J 

According to this hypothesis, the joint distribution of the two argu¬ 
ments contains r + ^ - 2 unknown parameters, since by means of the 
last relations two of the r + s constants, say p r . find p. iy may be ex¬ 
pressed in term8 of the remaining r + 5 — 2. 

In order to apply the % % test to this problem, we hare to calculate 

X ^ T 1 n P* P-J ， 


where the sum is extended over all rs classes of the contingency 
table, and replace here the parameters pi. and p.j by their estimates 
derived from the equations (30.3.3) or (30.3.3 a), which in this case 
become 



( i = 1， …， ，•一 1), 


难普°， 


(;•= 1， • . , s - \ 


The solution of these equations is 

442 









30 .» 


P“: 


Vi. 

-—， 

n 


P.j: 


so that the estimates to be used are simply the frequency ratios cal¬ 
culated from the marginal totals. Substituting these estimates for 
Pi. and p.j 、 the expression for x % reduces to 

(30.5.1) 卜 ” 2 —-- -- - -= 叫 


v'n 


Since we have here rs groups and r + « — 2 parameters determined 
from the sample, the limiting distribution of has rs — (r + 5 — 2) — 1 = 
=(r — 1)(« — 1) d. of fr. 一 Exact expressions for the mean and the 
variance of M defined by (30.5.1) have been given by Tarioui 
authors (cf Haldane, Ref. 123， where farther references are given). 
Assuming that the independence hypothesis is true, we have 


(30.5.2) 


E(x')- 


1 


(r — l)(s — 1). 


The yariance has a complicated expression that will not be flfiven here. 

A large value of x % ahowsi that the deviation from the hjpothetia 
of independence is significant, but g^ves no direct information about 
the degree of dependence or association between the argument*. On 
the other hand, the quantity 

V 一 T n) 

n 2 Vi. v.i 

uj 一 • — 

n n 


f = ‘i 


is the sample characteristic corresponding to the mean square con 
tingency g> f defined by (21.9.6). If q is the smallest of the numbers r 
and s, it follows from 21.9 that 


0^ 


X 


< 1. 


g — 1 n(q—\) 

The upper limit 1 is attained when and only when each row (when 
r ^ s) or each column (when r ^ s) contains one single element different 


from zero. Thus 


r 


may be regarded as a measure of the degree 


” (g — 1) 

of association indicated by the sample. The distribution of this measure 
is, of course, obtained by a simple change of variable in the distribu 
tion of z 1 . (For other measures of association, cf e. the text-book by 
Yule-Kendall, Ref. 43, chs 3 一 4.) 


443 



30.5 


At the Swedish census of March 1936， a sample of 25 263 married 
couples was taken from the population of all married couples in 
country districts, who had been married for at most five years. Table 
30.5.2 gives the distribution of the sample according to annual income 
and number of children. From (30.5.1) we obtain x 2 = 568.5 with 
(5 — 1)(4 — 1) = 12 d. of fr., so that the deviation from the hypothesis 
of independence is highly significant. On the other hand, the measure 


of association is 
of dependence. 


«(?-!) 


0.00760, thus indicating only a slight degree 


Table 30.5.2. 


Distribution of married couples according to annual income and 
number of children. 


Income (unit 1000 kr) 
1-2 I 2 t3 


! ]61 


3 577 


!766 


6 081 




3 046 25 263 


In the particular case when r = ^ = 2, the contingency table 30.5.1 
becomes a 2 - 2 table or a fourfold table, and the expression (30.5.1) 
reduces to 


(30.5.3) 


' V\. V-l. V.i V.2 


so that /* = x % l n corresponds to the expression (21.9.7) for〆. When 
the arguments are quantitative, f 2 is identical with the square of the 
correlation coefficient of the sample (cf 21.9.7 and 21.7.3). — In the 
case of a fourfold table, there is only (2 一 1)(2 — 1) = 1 d. of fr. in 
the limiting- distribution of x' we have q 一 1 = 1. 

In Table 30.5.3, we give the distribution of head hair and eyebrow 
colours of 46 542 Swedish conscripts according to Lundborg and Lin¬ 
ders (Ref. 26). From (30.5.3) we obtain 19288 and /* = 0.414, 
indicating a marked dependence between the arguments. 

444 












30 . 5-6 


Table 30.5.3. 


Hair colurs of Swedish conscripts. 



Head 

bair 

Total 

Eyebrows 

Light or 
red 

Dark or 
medium 

Light or red. 

30 472 

3 ns 

33 710 

Dark or medium ... 

3 364 

9 468 

12 832 

Total 

33 836 

12 706 

46 542 


V j , - 

When the expected frequencies ― in a fourfold table are small 

the approximation obtained by the usual ? tables will be improved 
if we calculate x t from the first expression (30.5.1), and reduce the 

absolute value of each difference v tJ - by J before squaring. 

This is known as Yates' correction (Ref. 250). 

30.6. x 2 as a test of homogeneity. 一 The contingency table 30.5.1 
expresses the joint result of a sequence of n repetitions of a random 
experiment, each individual result being classified according to two 
variable arguments. In many cases, however, we encounter tables of 
the same formal appearance, where the situation is different. 

Suppose that we have made s successive sequences of observations, 
consisting of n u observations respectively, where the numbers 

rij are not determined by chance, but are simply to be regarded as 
given numbers. At each observation we observe a certain variable 
argument, and the results of each sequence are classified according 
to this argument in r groups, the number of observations in the e:th 
group of the ; th sequence being denoted by v t j. Our data will then 
be expressed by a table which is formally identical with Table 30.5.1, 
the column totals v.j being here denoted by In the present case, 
however, the table does not express the result of one single sequence 
of observations, but of s independent sequences, each of which cor¬ 
responds to one column of the table. 

In such a case, it is often required to test the hypothesis that 
the s samples represented by the columns are drawn from the same 
population^ so that the data are homogeneous in this respect. This is 
equivalent to the hypothesis that there are r constants p u . . p r with 

445 






30.6 


2 川 s 1， such that the propability of a result belonging to the t:th 

竄 roup is equal to pt in all s sequences. 

In order to test this hypothesis, we calculate x* ^rom the same 
formula (30.5.1) as in the previous case. A slight modification of the 
proof of 30.3 then shows that, if the hypothesis is true, has the 
usual limiting distribution with the same number (r — 1)(^ — 1) of d. 
of fr. as before. 


Unlike the corresponding proposition of the preceding paragraph, this is not a 
direct corollary of the general theorem of 80.3, but requires separate proof The 
theorem of 30.3 may, in fact, be generalized to the case when we consider 8 inde¬ 
pendent Mmplea of n„ . . individuals, all with the same r frequency groups, and 
determine a certain number, say t f of unknown parameters by applying tlie modified 

(v — w p )* 

minimum metEod to the expression ^* = V —^ - \ straightforward 

:, J n j^i 

generalization of the proof of 30 3 then shows that /* has the usual limiting distri¬ 
bution with (r—'1)«— t d. of fr. In the case considered above, we are concerned 
、'ith the hypothesis that the #» samples are drawn from the same population, without 
further specification of the distribution, so that the parameters are the probabilities 
P t themselves. Owing to the relation = 1 there are t = r — 1 parameters, and thus 
(r- 1 )( 卜 1) d. of fr. 


By mean8 of the generalized theorem, we may also apply to test the hypo¬ 
thesis that s given nampies are drawn from the same population of a sptafitd type 
such &s the Poisson, the normal, etc. In such a case, the application of the modified 
X* minimum metbod to the above expression for % l shows that the parameters of 
the distribution should be determined in the same way as if we were concerned with 
one single sample with group frequencies equal to the row sums v t of the given 
table. The proof of this statement will be left as an exercise for the reader. 


In the particular case when r = 2 y the table may be written: 


v \ h . y« 

w i — v i w g —v, . . . n g — v B 

2 ” 

j 

«— 2 巧 
j 

w g . . . n. 

n 


We are here concerned with s sequences of observations, the number 
of occurrences of a certain event E being respectively v $ , and 

we ask whether it is reasonable to assume that E has a constant, 
though unknown probability p throughout the observations. The 

446 




30.6 


estimate to be used for p will here be the frequency ratio of E in 
the totality of the data: jd* = 1 — ^ 2 *!/» and we obtain from 

the formula (30.5.1) 1 ) 

七 in 

with s — 1 d. of fr. Writing Q- =~~7 - ^-r the quantity Q i» 

n 一 1 ) 

identical with the divergence coefficient introduced by Lexis. In ac¬ 
cordance with (30.5.2), we have E(Q X ) = 1. (Cf e. g. Tschuprow, Ref. 
227 a, Cramer, Ref. 10, p. 105 — 123.) 

Table 30.6.1 gives the number of children born in Sweden during 
the s = 12 months of the year 1 935. The estimated probability of a 

male birth is p* = = 0 617 5082. From (30.6.1) we find 义 * = 14 »86 

with 11 d. of t*r, which corresponds to P = 0.18, so that the data 
are consistent with the hypothesis of a constant probability. 

Table 306.1. 

Sex distribution of children born in Sweden in 1935. 


Month 


Boys .. 

Girls 

1 1 2 i a 

4 丨 6 丨 8 丨 7 

8 9 

10 

11 

一 

12 

Total 

I ! 1 

3743|3650 ； 4017| 4173 
3537|3407 ； 3800|3711 

1 j 

411713944! 3064 
3776|366dl 3621 

1 

379713712 

36961 3491 

3612 

3301 

83W2 

3160 

3761 

3371 

45682 

42691 

Total 

7280 69571 7883 7884 7892i7609j7686 

7393 7203 

6903 6662 

7132 

88273 


We finally consider the case s — 2. In this case we are concerned 
with two independent samples, and we want to know whether these 
are drawn from the same population. The table may then be written 


Ml 


Ml + 


v i 

一 

fh - 

h h 



f^r - 

卜，， r 

m 


m H 

- n 


Cf also (30.2.1、. 


447 









30.6 


We have r — 1 d. of fr., and (30.5.1) gives (cf. K. Pearson, Ref. 186, 
and B. A. Fisher, Ref. 91) 

(30.6.2) ^(« ，_ «) - “ 

Writincr ^ — and iar= — —， this reduces to the following 

* Hi 4- Vi »n + n 

expression due to Snedecor (Ref. 35, p. 173)，which is often convenient 
for practical computation, 

(30.6.3) 

= ^ir-^j(2). 

Table 30.6.2 gives some income distributions from the Swedish 
census of 1930. When we compare the income distributions of the 
a^e groups 40 — 50 and 50 — 60 for all industrial workers and employees, 

(30.6.3) gives x 2 = 840.62 with 5 d. of fr., showing a highly significant 
difference between the distributions. It is evident that in this case 



Table 30.6.2. 

Income distributions from Swedish census of 1930. 




























30 . 6-7 


the numbers 坏 show a tendency to increase with increasing income. 
When we pass to the more homogeneous group of the foremen, however, 
this tendency disappears, and the comparison of the income distribu¬ 
tions of the two a,ge groups gives here x % — 4.27 and P — 0.51, so that 
we may consider these two samples as drawn from the same population. 


30.7. Criterion of differential death rates. — Suppose that, in a 
mortality investigation, we have obtained the following data for two 
different classes (districts, occupations etc.) of persons. 


Age group 

Class A 

Class B 

Exposed to risk 

Deaths 

Exposed to risk 

Deaths 


w, 


n\ 


2 

n, 

d. 

n\ 

. i 

) 

w r 

K 


i 

• I 

i 


It is required to test whether the sequences of death rates dthh and 
d\jn[ obtained from these data are significantly different. For each 
age group, we may form a 2 2 table of the type 

Class A. Class B 

Dead. d x d ： 

Surviving. n t 一 d t n\ — d\ 

and calculate from (30.6.2) the corresponding- quantity 
= _ tt,n t (n t -f wj)_ /d t __ ct,V 

1 (d t + d\) (w, + n\ — di — \w, 7i[j 

which has one d of fr. The successive X\ are independent. Thus if 
we assume that the two populations Lave identical death rates, the 

sum X 1 = 2 has the usual limiting distribution with r d of fr , 
* 

and this provides a teat of the hypothesis (cf K Pearson and Tocher, 
Ref. 187; E. A. Fisher, Ref 91, Wahlund, Ref. 228). 

Table 30.7.1 contains some data from a tuberculosis investigation 
by 6. Berg (Ref. 61). It is required to test whether there are any 
significant differences in mortality between the two sexes during the 

449 











30 . 7-8 


first year after the finding of T. B. plus. The total x % amounts to 
22.2 with 10 d. of fr., which corresponds to P = 0.014, so that the 
deviation is »almost significant» according to our conventional termi¬ 
nology (cf 30.2). Prom the values of Xl given in the last column of 
the table, it is seen that the main contributions to arise from 
the ages 30 — 50, where the women show a considerably higher mor¬ 
tality than the men. 

Table 30.7.1. 


Death rates for patients Buffering from open pulmonary tuberculosis, 
during first year after finding T.B. plus. 



Men 

Women 


Age group 

Exposed 
to risk 

Deaths 

Death 

rate 

% 

Exposed 
to risk 

K 

Deaths 

< 

Death 

rate 

% 

y.l 

16-19 

400 

156 

38.4 

600 

174 

34 8 

1 25 

20-24 

696 

204 

29.4 

816 

246 

30.1 

0.11 

26-29 

586 

169 

28.9 

619 

184 

29.7 

0.09 

30-34 

464 

128 

28.2 

433 

160 

34.« 

4.22 

36-39 

274 

82 

29.0 

257 

92 

36.8 

2.10 

40—44 

221 

68 

30.8 

194 

83 

42.8 

6.43 

45-49 

153 

41 

20.8 

94 

39 

41.5 

5.75 

60-64 

110 

34 

30.9 

58 

20 

34.5 

0.28 

66-59 

09 

36 

62 2 

20 

13 

44 8 

0.45 

60— 

89 

43 

48.8 

47 

28 

69 6 

1.57 

Total 

3 056 

96 】 


3 047 

1 029 


22.20 


30.8. Further tests of goodness of fit. — As already observed in 
30.1, it is always advisable to try to supplement the % 8 test other 
methods. In many cases, a simple inspection of the signs and magni¬ 
tudes of the differences between observed and expected frequencies 
will reveal systematic deviations from the hypothesis tested, even 
though may have a non-signiScant value. 

When the X 2 test is applied to a comparatively small sample, it is 
necessary to use a grouping with large class intervals, and thus sacri¬ 
fice a good deal of the information conveyed by the sample. In such 
cases, it would be desirable to have recourse to a test based on the 
individual sample values. We shall now briefly mention a test of 
this type. 


450 











30.8 


When the individual sample values are known, the exact value of 
may thus be simply calculated. When only a grouped sample is avail¬ 
able, an approximate value can be found, e. g. by the usual assumption 
tliftt th© Xy are situated in the mid-points of the cl&ss intervals. 

As observed in 25.5, F*(x) is the frequency ratio in n trials of an 

event of probability F(x). Hence E{I^ — Ff = H i ■ 一 H e y means 

n 

of this remark, it is possible to find the mean and the variance of w*. 
These are independent of F(x), and we have 


E (a ) 1 )= 


6 n 


: 


4 n 一 3 
180 ^* 


Comparing the value of w 8 found in an actual sample with the mean 
and the variance calculated from these expressions, we obtain a test 
of our hypothesis. 一 The sampling distribution of oi*，which is inde¬ 
pendent of F{x), has been further investigated by Smirnoff (Ref. 215), 
who has shown that nw % has, as n oo, a certain non-normal limiting 

451 


Let it be required to test the hypothesis that a sample of n ob¬ 
served Talnes a:” has been drawn from a population with the 

given d. f. F(pc). The d. f. of the sample (cf 25.3) is (x) = y/n, where 
y is tbe number of sample values ^ x. Since F* converges in prob* 
abibilitj to F (cf 25.5) for any fixed x, we may consider the integral 

oo 

f [F^^-Flz^dKlz), 

— 00 

where K(x) may be more or less arbitrarily chosen, as a measure of 
the deviation of our sample from the hypothesis/ Tests based on 
measures of this type were first introduced by Cram 谷 r (Ref. 10 and 
70) and von Mises (Ref. 27). Following Smirnoff (Ref. 215), we shall 
here take K(x) = F(x) t and thus obtain the integral 

00 

= f lF*(x)-F(x)ydF(x). 

— 00 

If the sample values x lt . . . x n are arranged in increasing order, we 
have for any continuous F(x) 


r—-- 

-21 
11 w 
+ 

r” 

12 

3 



30.8-31.1 

distribution independent of n (cf the case of n r* in 29.13). It would 
be desirable to extend the theory to cases when the hypothetical 
F(x) is not completely specified, but contains certain parameters that 
must be estimated from the sample. 

Further important tests of goodness of fit have been proposed e. g. 
by Nejman (Ref. 164) and E. S Pearson (Ref 191). 


CHAPTER 31 • 

Tests of Significance for Parameters. 

31.1. Tests based on standard errors. 一 In the applications, it 
is, often required to use a set of sample values for testing the hypo¬ 
thesis that a certain parameter of the corresponding population, such 
as a mean, a correlation coefficient, etc.，has some value given in ad¬ 
vance. In other cases, several independent samples are available, and 
we want to test whether the differences between the observed values 
of a certain sample characteristic are significant, i. e. indicative of a 
real difference between the corresponding population parameters. 

Now we have seen in Ch. 28 that important classes of sample 
characteristics are, in large samples, asymptotically normal with means 
and variances determined by certain population parameters. Hence we 
may deduce tests of significance for hypotheses of the above type, 
following the general procedure indicated in 26.2 (cf also 35.1). 

Thus if we draw a sample of n values x ly . . x n from any popula¬ 
tion (not necessarily normal) with the mean m and the 8. d. cr, we know 
by 17.4 and 28.2 that the mean x of the sample values is asymptotic¬ 
ally normal (m, a/V~n). Suppose for one moment that we know or, and 
that we are testing the hypothesis that m has a specified value w 0 . 
If the hypothesis is true, x is asymptotically normal (m 0 , (t/V^w). De¬ 
noting by k p the p % value of a normal deviate (cf 17.2), we thus 
have for large n a probability of approximately p % to encounter a 
deviation \x — m 0 \ exceeding X p ajVn. Working on a ^ % level, we 
ihould thus reject the hypothesis if | 念 一 tn 0 j exceeds this limit, whereas 
a smaller deviation should be regarded as consistent with the hypo¬ 
thesis. 


452 



31.1 


Now in practice we usually do not know a. By 27.3 we know, 
however, that the 8. d. s of the sample converges in probability to a 
as w -*■ oo. Hence for large n there will only be a small probability 
that s differs from a by more than a small amount. For the purposes 
of our test, we may thus simply replace a by s, and act as if we had 
to test the hypothesis that x were normal (m 0 , s/vn)^ where s is the 
known value calculated from our sample. An observed deviation 
I 龙 一 m 0 1 exceeding X p s/V~n will then lead us to reject the hypothesis 
w == m 0 on up% level, while a smaller deviation will be regarded as 
consistent with the hypothesis. 

The same method may be applied in more general cases. Consider 
any sample characteristic z, the distribution of which in large samples 
is asymptotically normal. In the expression for the variance of the 
asymptotic normal distribution of z, we replace any unknown popula¬ 
tion parameter bv the corresponding ： known sample characteristic, 
retaining only the leading term ot the expression for large n. The 
expression d(z) thus obtained will be denoted as the standard error 
of z in large samples. If it is required to test the hypothesis that 
the mean E(e) has some specified value z 0 , we regard z as normally 
distributed with the known s. d. d(z). If the deviation \z ^ z 0 \ ex¬ 
ceeds d {z), the hypothesis will then be rejected on the p % level, and 
otherwise accepted. 

In this way, all expressions deduced in Chs 27 — 28 for the s d.s 
of sample characteristics and of their asymptotic normal distributions 
may be transformed to standard errors. Thus e. g. by (27.2.1), (27.4.2) and 
(27.7.2) the standard errors of the sample mean the sample variance 
5* = m t and the sample s. d. s = V^m % are 

刪 = 六’ 叫卜今 dW = - 

If it is assumed that the population is normal, the simpler expressions 
corresponding to this case may be applied. Thus e. g. by 28.5 the 
standard error of the median of a normal sample is 

s Vjtf(2 ”）= 1 .2633 s/V n . 

When a sample characteristic z has been computed, it is customary 
in practice to indicate its degree of reliability by writing the value z 
followed by ± d(e). Thus e. g. the sample mean is written 1 + s/V n y 

453 



31.1 


etc 一 For the frequency ratio in n tri&U of an event of constant 
probability p y we have by (16.2.2) E(v/n) = p and D(v/u) = Vp q/v, 

so that the standard error 


quencj 


"I / v (” 一 y) 

is W ———， and consequently the fre- 
ratio will be written ~ + The corresponding per. 


v / 

centage tzr = 100- is accordingly written tu* 士 / 


tarflOO — tzr) 


If two independent samples are given, the difference between tlieir 
means or any other characteristics may be tested with the aid of 
the standard errors. If the means x and y are regarded as normal 


(wa,, s x fV and (m g , sjVn^) respectively, the difference x — y will be 


normal (mi 


+ —}, and any hypothesis concerning the 




value of the difference m, — m t can now be tested in the way shown 
above. In particular, the hypothe»is m, = m, will be rejected on the 

1 . 


P % level, if \x — y\ > l p 


+ 二， and otherwise accepted. 

w i 


All the above methods are valid subject to the condition that 
our samples are »large*. There are two kinds of approximations in¬ 
volved, as we have supposed a) that the sampling distributions of our 
characteristics are normal, and b) that certain population characteris¬ 
tics may be replaced by the corresponding values calculated from the 
sample. In practice, it is often difficult to know whether our samples 
are so large that these approximations are valid. However, some 
practical rules may be given. When we are dealing： with means, the 
approximation is usually good already for n > 30. For variances, me¬ 
dians, coefficients of skewness and excess, correlation coefficients in 
the neighbourhood of ^ = 0, etc., it is advisable to require that n 
should be at least about 100. For correlation coefficients considerably 
different from zero, even samples of 300 do not always give a satis¬ 
factory approximation. 

Even in cases where n is smaller than required by these rules, or 
where the sampling distribution does not tend to normality, it is often 
possible to draw some information from the standard errors, though 
great caution is always to be recommended. — When the sampling 
distribution deviates considerably from the normal, the tables of the 
normal distribution do not give a satisfactory approximation to the 
probability of a deviation exceeding a given amount. We can then 


454 



31.1-2 


always use the inequality (15.7.2)，which for any distribution gives 
the upper limit \!k l for the probability of a deviation from the mean 
exceeding k times the s. d. However, in most cases occurring in prac¬ 
tice this limit is unnecessarily large. It follows, e. g. f from (15.7.4) 
that for all unimodal and moderately skew distributions the limit may 
be substantially lowered. The same thing follows from the inequality 
given in Ex. 6, p. 256, if we assume that the coefficient y t of the 
distribution is of moderate size. When there are reasons to assume 
that the sampling distribution belongs to one of these classes, a devia¬ 
tion exceeding four times the s. d. may as a rule be regarded as 
clearly significant. 一 When n is not large enough, it is advisable to 
use the complete expressions of the s. d:8, if these are available, and 
not only the leading terms. Further, we should then use the unbiased 
estimates (cf 27.6) of the population values, thus writing e. g. s/Vn 1 
instead of slV\\ for the standard error of the mean. 一 Whenever 
possible ifc is, however, preferable to use in such cases the tests based 
on exact distributions that will be treated in the next paragraph. 


31.2. Tests based on exact distributions. — When the exact sampling 
distributions of the relevant characteristics are known, the approxi¬ 
mate methods of the preceding paragraph may be replaced by exact 
methods. As observed in 29.1, this situation arises chiefly in cases 
where we are sampling from normal populations. 

Suppose, e. g., that we are given a sample of n from a normal 
population, with unknown parameters m and a } and that it is required 
to test the hypothesis that m has some value given in advance. If 
this hypothesis i8 true, the sample mean x is exactly normal (wi, a/Vn), 

and the standardized variable VH- ——- is normal (0,1). The approxi¬ 


mate method of the preceding paragraph consists in replacing the 
unknown a by an estimate calculated from the sample — for 


small n preferably 




obtained, t = Vn — 1 — 


m 


6. — and regard the expression thus 
as normal (0, 1). Now t is identical 


with the Student ratio of 29.4, and we have seen that the exact distri¬ 
bution of t is Student's distribution with « — 1 d. of fr. If t p denotes 
the p % value (cf 18.2) of t for w — 1 d. of fr., the probability of a 
deviation such that | < | > is thus exactly equal to p %. The hypo. 


455 



31.2 


thetical value m will thus have to be rejected on a ^ % level if 
\t\> t p> and otherwise accepted. 

As w — 00 ， the ^-distribution approaches the normal form (cf 20.2), 
and the figures for this limiting case are given in the last row of 
Table 4. It is seen from the table that the normal distribution gives 
a fairly good approximation to the ^-distribution when n ^ 30. For 
small w, however, the probability of a large deviation from the mean 
is substantially greater in the ^-distribution (cf Fig. 20, p. 240). 

When we wish to test whether the means x and y of two inde¬ 
pendent normal samples are significantly different, we may set up the 
»null hypothesis* that the two samples are draun from the same nor¬ 
mal population It has been shown in 29.4 that，if this hypothesis is 
true, the variable 


(31.2.1) 


• / n t v % (n x + — 2) 

' », *f n t 


x — y 

V Wj + W 4 sl 


has the (-distribution with w 1 -f w, 一 2 d. of fr. When the means and 
variances of the samples are given, u can be directly calculated. If 
I u I exceeds the p % value of t for n t 十 — 2 d. of fr.，our data 
show a significant deviation from the null hypothesis on the p % level. 
If we have reason to assume that the populations are in fact normal, 
and that the s. d:s a, and cr, are equal, the rejection of the null 
hypothesis implies that the means m i and m t are different (cf 35.5). 

It is evident that we may proceed in the same way in respect of 
any function z of sample values, as soon as the exact distribution of 
z is known. We set up a probability hypothesis, according to which 
an observed value of z would with great probability lie in the neigh¬ 
bourhood of some known quantity z 0 . If the hypothesis H is true, z 
has a certain known distribution, and from this distribution we may 
find the p % value of the deviation \z — z 0 \, i. e. a quantity h p such 
that the probability of a deviation \e — z 0 \> h p is exactly p %. 
Working on a p % level, and always following the procedure of 26.2, 
we should then reject the hypothesis H if in an actual sample we 
find a deviation \ z — z 0 \ exceeding h Pt while a smaller deviation should 
be regarded as consistent with the hypothesis (cf 35.1). 

When we are concerned with samples drawn from normal popula¬ 
tions. tests of significance for various parameters may thus be founded 
cm the exact sampling distributions deduced in Ch. 29. In practice, 
it is very often legitimate to assume that the variables encountered 

456 



31.2-3 


in different branches of statistical work are at least approximately 
normal (cf 17.8). In such cases, the tests deduced for the exactly 
normal case will usually give a reasonable approximation. It has, in 
fact, been shown that the sampling distributions of various important 
characteristics are not seriously affected even by considerable devia¬ 
tions from normality in the population. In this respect, the reader 
may be referred fco some experimental investigations by E. S. Pearson 
(Ref. 190), and to the dissertation of Quensel (Ref. 200) on certain 
sampling distributions connected with a population of Charlier's type 
A. It seems desirable that investigations of these types should be 
further extended. 


31.3. Examples. 一 We now proceed to show some applications 
of tests of the types discussed in the two preceding paragraphs. We 
shall first consider some cases where the samples are so large that it 
is perfectly legitimate to use the tests based on standard errors, and 
then proceed to various cases of samples of small or moderate size. 
With respect to the significance of the deviations etc. appearing in 
the examples, we shall use the conventional terminology introduced 
in 30.2. 

Ex. I. In Table 31.3.1 we give the distribution according to sex 
and ag^ea of parents of 928 570 children born in Norway during the 
years 1871—1900. (From Wicksell, Ref. 231.) It is required to use 
these data to investigate the influence, if any, of the ages of the 
parents on the sex ratio of the offspring. 

As a first approach to the problem, we calculate from the table 
the percentage of male births, and the corresponding standard error, 
for four large age groups, as shown by Table 31.3.2. 

There are no significant differences between the numbers in this 
table. The largest difference occurs between the numbers 51 .卿 
and 51.111, and this difference is 0.478 ± 0.222. The observed difference 
is here 2.16 times its standard error, and according to our conventional 
terminology this is only >almost significant*. Nevertheless, the table 
mig^ht suggest a conjecture that the excess of boys would tend to 
increase when the age difference x — y decreases. 

In order to investigate the question more thoroughly, we consider 
the ag;es x and y of the parents of a child as an observed value of 
a two-dimensional random variable. Table 31.3.1 then gives the joint 
distributions of x and y for two samples of n x = 477 533 and 
n, = 451 037 values, for the boys and the girls respectively. If the 

457 



31.3 


Table 31.3.1. 

Live born children in Norway 1871 一 1900. 



26—30 

30 — 36 

656 

| 

187 

11 173 

3 448 

43 082 

16 760 

38 605 

41 208 

17 014 

32 240 

0 686 

10 214 


85 — 40 

40—46 

93 

25 




36 147 
94 272 
112 670 


^^^^HSSEKSBQI&EBIIESSSSDESESSHflSBIl^^^^ 





iHKIHBIKIfllliflllMUilllimilH ； 



sex ratio among the newborn varies with the ages of the parents, the 
(r ， f/) distribution must be different for the boys and the girls, so that 
the two samples are not drawn from the same population. 


458 























































31.3 


Table 31.3.2. 


Percentage of male births. 



Age of mother 

Age of 
father 

y 

X 

<80 

>30 

< 86 

61.409 ±0.ooo 

61.B8B±0.123 

>35 

61.111 ±0.186 

61.480 ±0.081 


Tablr 31.3.3. 

Sample moments for Table 31.3.1, in units of the classbreadth (5 years). 


Central 

moments 

Boys 

Girls 

Raw 

Corrected 

Haw 

Corrected 


2.9187 

2.8294 

2.9086 

2.8208 

»»n 

1.4140 

1.4140 

1.4086 

1.4086 


1.7966 

1.7128 

1.70W 

1.7096 

m l0 

3.0009 

3.0000 

3.0891 

3.0891 

w 0 , 

0.4688 

0.4588 

0.4588 

0.4588 

wi 40 

28.M70 

27.2807 

28.4 & 85 

27.0800 

m n 

10.8627 

9.9992 

10 2509 

0.8988 


7.7286 

7.8481 

7.6970 

7.8128 

m,, 

6.8110 

6.4575 

6.8020 

5.4490 


7.5260 

6.6564 

7.5260 

6.M87 


Table 31.3.3 shows the uncorrected moments of the two samples, 
and the corrected moments calculated according to (27.9.4) and (27.9.6). 
We first observe that the distributions deviate significantly from nor¬ 
mality. Consider, e. the marginal distribution of the father's age 
x for the boys. On the hypothesis that this distribution is normal, 
we find from the corrected moments g x = 0.6460 t. 0.0085 and g f = 
== 0.4016 ± 0.0071, where the standard errors are calculated from (27.7.9). 
In both cases, the deviation from zero is highly significant, so that 
the hypothesis of normality is clearly disproved. 1 ) 

•) According to Wicksell, 1. c., the distribution is approximately logarithmic。- 
normal (cf 17.6). 


459 







31.3 


Table 31.3.4. 

Sample characteristics for Table 31.3.1. Unit: one year. 


Characteristics 

Boys 

Girls 

10*- Diff. 

X 

36 699 ± 0 0122 

35 70S ±0.0126 

-f 4±175 

y 

32 128 ±0 0095 

32 116 ± 0.0097 

-12±13.« 

x-y 

3 571 ±0 0095 

3 687 ± 0.0097 

+ 16±130 


8 410 ±0 0094 

8 897 ± 0 0097 

-13±13.5 


6.648 ±0 0058 

6 5S8 ±0 0055 

一 6i 7.c 

r 

0 6424 ±0.0OO97 

0 6414 ±0 00101 

— 1.0+ 1 40 


Table 31.3.4 gives the values of some important sample characteris¬ 
tics for the boys and the 钇 iris, as well as the differences between 
corresponding characteristics for both sexes. The standard errors have 
been calculated according to the rules of 31.1 from the general for¬ 
mulae (27 2.1), (27.7 2) and (27.8.1); thus the simpler expressions (27.8.2) 
and (29.3 3) corresponding to the case of a normal population have 
not been applied here. For the difference I — y, we find 

£>’ (x —y) = (a\ —2 q a v a t 4 - aj)/w, 
and consequently the square of the standard error is 
d l (x 一 y) = 一 2 r + 

It is seen from the table that there are no significant differences 
between the characteristics. In particular we find that the mean of 
the age difference x — y is not significantly greater for the girls than 
for the boys, so that the conjecture suggested by Table 31.3.2 is not 
supported by further analysis. 

Finally, we may directly apply the % 2 method to test whether the 
two samples in Table 31.3.1 may be regarded as drawn from the same 
population. In each of the two samples we have, in fact, 12 • 7 = 84 
frequency groups, so that the whole table 31.3.1 may be rearranged 
as an 84 2 table of the type considered in 30.6, which may be tested 
for homogeneity by the x % method, using (30.6.2) or (30.6.3) for the 
calculation of x* Pooling all groups with fathers above 60, and with 
mothers above 40, we have a 60.2 table, and find X 2 — 51.97 with 
(60 一 1)(2 — 1) = 59 d. of fr. According to Fisher’s approximation 

460 




31.3 


(cf 20.2), V2= 10.20 would then be an observed value of a normal 
variable with the mean V 117 = 10.82 and unit s. d. By Table 1, the 
probability of obtaining a value of at least as large as that actually 
observed is then approximately 1 —■ (P(10.20 — 10.82) ■= 0 73, so that 
the agreement is very good, and the data are consistent with the 
hypothesis that the samples are drawn from the same population. 

The analysis of the data in Table 31.3.1 has thus entirely failed 
to detect any significant influence of the ages of the parents on the 
sex cf the children. 

Ex. 2. In a racially homogeneous human population, the distribu¬ 
tions of various body measurements usually agree well with the nor¬ 
mal curve, and the small deviations are well represented by the first 
terms of a Charlier or Edgeworth series, as given e. g. by (17.7.5). 
We refer in this connection to a paper by Cramer (Ref. 70), where 
detailed examples are given. 

In such cases, the standard errors of sample characteristics may 
be calculated from the simplified expressions which hold for the 
case of a normal parent distribution. Thus by (29.3.3) the standard 
error of s may be put equal to slV2~n, the standard error of the 
coefficient of variation V may be calculated from (27.7.11), etc. 

For the stature of Swedish conscripts, measured in the years 
1915 一 16 and 1924 — 1925 at an average age of 19 years 8 months, 
we find according to Hultkrantz (Ref. 128) the sample characteristics 
given in Table 31.3.5. The table shows a highly significant increase 
of the mean and the median during the interval of 9 years between 
the measurements. On the other hand, the s. d. and the coefficient 


Table 31.3.5. 

Sample characteristics for the stature of Swedish conscripts. 


Characteristics 

1915—16 

1924—26 

10，. Diff. 

n. 

80 084 

89 337 


Mean x .cm 

171.80 + 0.022 

172.68 ±0.020 

+ 78 ±3.0 

Median. • 

171.81 ±0.027 

172.65 ±0.025 

+ 74±3.7 


6.16 ±0.015 

6.04 ±0.014 

-ll±2.l 

Semi-interquartile range . » 

4.06 ±0.017 

4.02 ±0.016 

- 3 ±2.8 

100 100 afx . 

3.58±0.0090 

3.60 ± 0.0088 

一 8 ±1.2 


461 













31.3 


of variation show a highly significant decrease, while the ddcreaae 
of the semi-interquartile ran^e not significant. 

These results agree well with further available data from Swedish 
conscription measurements. During the last 100 years, the mean 
stature of the conscripts has steadily increased, while the a. d. has 
decreased. 

According to Table 31.3.6, the increase of the mean stature for 
the observed samples during the period of 9 years atnounts to 
0.78 土 0.080 ciu. What kind of conclusions can we draw from thii 
fact with respect to the unknown increase dm of the population 
mean m? 一 We have, in fact, observed the value 0.78 cm of a vari¬ 
able which is approximately normally distributed, with the unknown 
mean Jm, and a s. d. approximately equal to ().080 cm. Let us, for 
the sake of the argument, assume that the word »approximately» may 
be omitted in both places, and let as usual k p denote the p % value 
of a normal deviate (cf 17.2). Consider the hypothesis that dm \% 
equal to a given quantity c. If we are working* on a p % level, this 
hypothesis will evidently be regarded as consistent with the data if 
c is situated between the limits 0.78 土 0.080 while otherwise it will 
be rejected. The quantities ().78 ± O.oao X p are called the p % confidence 
limits for Jm t and the interval between these limits is the p % con> 
fidence interval. 一 We shall return to these concepts in Ch. 34. 


Ex. 3. The occurrence of exceptionally high or low water levels 
in lakes or rivers is often of great practical importance. For the 
average water levels of Lake Vanern in the month of June of the 
w = 124 years 1807 — 1930, we have (data from Lindquist, Ref. 149) 
the mean £ «=s 4454.6 cm above sea level, and the s. d. s = 48.61 cm. 
The distribution agrees well with the normal curve. Grouping the 
original data (which are not given here) into 9 groups with the class- 
breadth A = 20 cm, we find x l ― 3.728. For 9 — 2 — 1 = 6 d. of fr. 
this gives P = 0.7l, so that the fit is very good. 

If we denote by x ¥ the v:tb value from the top in a normal 
Ham pie of w values, while y v is the v:th value from the bottom, the mean 
and the s. d. of are given by (28.6.16), while the corresponding 
expressions for y，are obtained by obvious modifications. Replacing 
in these expressions the population parameters m and a by the sample 
values x and 8 g^iven above, and neglecting the error terms, we obtain 
the means and standard errors given in Table 31.3.6, which also shows 
the extreme June levels actuallj observed during the period. 

462 



31.3 


Table 31.3.6. 

Extreme water levels of Lake Yanern, June 1807 — 1930. 


V 

observed 

Eix v ) 

approx. 

d(x v ) 

Diff. in 
units of 
stand, error 

observed 

my,) 

approx. 


Diff. in 
units of 
stand, error 

1 

4566 

4682.1 

20.04 

一 0.80 

4350 

4326.0 

20.04 

+ 1.16 

2 

4548 

4666.5 

12.55 

— 1.47 

4366 

4342.8 

12 65 

+ 1.07 

3 

4546 

4668.7 

9.82 

— 1.29 

4360 

4350.4 

9.82 

+ 0.98 

4 

4635 

4663.4 

8 82 

— 2.21 

4366 

4356.6 

8.82 

十 1.26 

6 

4636 

4540.5 

7.85 

— 1.97 

4366 

4350.6 

7.86 

+ 0.88 


The absolute magnitude of the differences between the observed 
values and their means is in no case greater than might well be due 
to random fluctuations. We observe, however, that all the lie below 
their means, and conversely for the y v . This is partly due to the 
correlation between the x ，(and the y r ), and partly to the fact that 
the approximate mean values are affected with considerable errors, 
since we are dealing with the comparatively low value w = 124. 

If we may assume that the distribution will remain unaltered for 
a period of, say, 500 years, we obtain in the same way as above the 
mean 4603.5 cm, and the standard error 17.6 cm, for the upper ex¬ 
treme level a:, during this period. It would thus seem highly improb¬ 
able that a level exceeding 4603.6 + 4 - 17.6 = 4673.9 cm will occur 
during this period. 

Ex. 4. From Student’s classical paper (Ref. 221) on the ^-distri- 
bution, we quote the figures given in Table 31.3.7. It is required to 
test whether there is any significant difference between the effects of 
the drugs A and B. If we assume that the difference between the 
gains in sleep effected by the two drugs is normally distributed, the 
last column of the table constitutes a sample of n = 10 values from 
a normal population. On the usual null hypothesis that there is no 
difference between the effects, the mean m a of this population is zero. 

If this hypothesis is true, the Student ratio ----- is distributed 

in the ^-distribution with 9 d. of fr. (cf 31.2). From the observed 
values, we find t = 4.06, which by Table 4 corresponds to a value of 
P between 0.01 and O.ooi. Thus the deviation from zero is significant, 
and the null hypothesis is disproved. 

463 










Table 31.3.7. 

Additional hours of sleep gained by ten patients through the use of 
two soporific drugfs A and B. 


Patient 

Drug A 

Drug B 

y 

Difference 

« - * - V 

1 

1.0 

0.7 

1.3 

a 

0.1 

-i.« 

2.4 

a 

11 

—O.a 

1.8 

4 

0.1 


1.8 

6 

-0.1 

— 0.1 

0.0 

6 

4.4 

S.4 

1.0 

7 

6.6 

JI.7 

1.8 

8 

l.« 

0.S 

0.8 

0 


0.0 

4.6 

10 

S.4 

3.0 

1.4 


x— 2.M 

Jj • 0.76 

f — l.M 


_ 1.IM 

$ t — 1.W7 

— 1.107 


In this case, where we have the low Talue n == 10, it is to be ex¬ 
pected that the approximate test based on the standard error of l 
will not give a very accurate result. If we apply this test, and use 
the estimate 8jV 10 — 1 for the standard error, we are led to regard 
the same value aa above, Vd (i — 0)/^ # = 4.06, as an observed value of 
a variable which, on the null hypothesis, is normal (0, 1). By Table 
2, this corresponds to P< 0.0001. If we compare this with the value 
of P given by the exact teat, it is seen that the error involved in 
applying the approximate test tends to exaggerate the significance of 
the deviation. 

If, in the experiment* recorded in Table 31.3.7, two different sets 
of ten patient® had been used to test the two drugs, the data roi^ht 
also have been treated in another way (cf R. A. Fisher, Ref. 13, p. 
123 — 125). Suppose that for each dru 汉 the gain in sleep is normally 
distributed, the 8. d. having the same value in both cases. The samples 
headed x and y are then independent samples from normal popula¬ 
tions with the same o, and it is required to test the null hypothesit 
that the two population means m t and m % are equal. The variable u 
defined by (31.2.1), where we have to take = w, = 10, then has the 

464 








31.3 


^-distribution with 18 d. of fr., and from Table 31.3.7 we find w = J 86, 
which corresponds to P = 0.08，so that in this way we do not find 
any significant difference between the effects 

In cases where we may assume that the x and y columns are in¬ 
dependent, both the above methods are available, and if either test 
shows a clearly significant difference, we must regard the null hypo¬ 
thesis as disproved, even if the other test fails to detect any signifi¬ 
cant difference. 一 In the case actually before us in Table 31.3 7 
there is, however, an obvious correlation between the x and y columns 
due to the fact that corresponding figures refer to the same patient, 
so that it is not legitimate to apply the second method 

Ex. 5. For the July temperatures in Stockholm for the n — 100 
years 1841 一 1940, we have (cf Table 30.4.2) the mean x = 16.982 
and the 8, d. s == 1.6146. For the 30 first and the 30 last years of the 
period, the means are respectively 16.893 and 17.463 Are these group 
means significantly different from the general mean 16 982? 

From the expression (29.4 5), we obtain t = — ().36 for the k — 3() 
first years, and ^ = 1 97 for the 30 last years, in both cases with 



Fif. 31. Prices of potatoes at 46 places m Sweden, December 1936 {x\ aud December 
1937 (y). Regression lines: —Orthogonal regression line. - - 

465 



31.3 


n — 2 == 98 d. of fr. Both values lie below the 5 % limit, so that this 
test does not indicate any significant change in the summer temperature 
during the century. 

Ex. 6. Fig. 31 shows the distribution of the prices of potatoes (ore 
per 100 kg) in December 1936 (x) and December 193^ (y), at n = 46 

places in Sweden, according to official statistics. The ordinary char¬ 

acteristics of the sample are 

£ = 660.67, y == 732.69, s x = 106.86 ， .v 2 = 120.91, 

r = 0.7928, b l% = 0.7007, b it = 0.8971. 

Let us assume that the (x, y)-values form a sample from a normal 
population, and that we wish to obtain information about the unknown 
values of the regression coefficient ji n and the correlation coefficient 
q of this population. _ 

s y^~n~2 

According to (29.8.4), the variable t— -~-=======z (b 2l — ft,) has Stu- 

,<? 2 V 1 — ?•* 

dent’s distribution with n — 2 d. of fr. Introducing the values of the 
sample characteristics given above, we may thus test the hypothesis 
that is equal to any given quantity c. If we are working* on a 
p % level, this hypothesis will be regarded as consistent with the 
data if c is situated between the limits 


办 *i 土 


t 

s v V n — 2 


where t p denotes the p % value of t for n — 2 d. of fr., while other 

wise the hypothesis will be rejected. These limits are the p % con¬ 
fidence limits for (cf Ex. 2 above). In the actual case we obtain 

in this way the following confidence limits for /3 gl - 

p ==■ 5 % . 0.687 and 1.107, 

p = l % , . . . . 0.617 and 1.177, 

p == 0.1 % . 0.630 and 1.204. 

For the sample correlation coefficient r = 0.79*28, we have by (27.8.1) 
and (27.8.2) approximately the mean q and the standard error 

d(r) = (1 一 r^jVn == 0.0648. 

If the sampling distribution of r shows a sufficiently close approach 

466 





31.3 


to normality, this may be used to test the hypothesis that q is equal 
to any given quantity. However, the sampling distribution of r 
tends rather slowly to normality, when q differs considerably from 
zero, and for n = 46 it must be expected that the results obtained 
by the use of the standard error are not very accurate. It is thus 
preferable to use the exact tables of the r-distribution (David, Ref. 
261) or the logarithmic transformation (29.7.3) — (29.7 4) due to R. A. 

, 丨 + r 

Fisher. In the latter case, we have to regard z = i log* -- - as norm¬ 
ally distributed, with the mean i log ^ + -： and the s. d. 

_ 】一 Q l 一 1) 

HVn — 3, so that the variable 

is normal (0, 1). Working on a ^ % level, we are thus led to regard 
the data as consistent with any hypothetical value of if 


,, 1 -f- Q Q 

^ 0R + W 二 Ti 

falls between the limits 

i log f-— + T ^== ， 

* 8 l-r- )/ W -3 

where X p is the p % value of a normal deviate, while otherwise the 
hypothetical value will be rejected. When r is known, these limits 
may be calculated for any p, and the corresponding values of q are 
then obtained by the numerical solution of an equation of the form 

i log ： + 3 = 左 . These values are the p % confidence limits 

for q. In the actual case, we obtain the following confidence limits 
for 泛： 

p = b % . 0-6486 and 0.8783 ， 

= 1 % . 0.5913 and 0.8980, 

p = 0.1 % .0.5164 and 0.9171. 

Ex. 7. Table 31.3.8 gives the values (taken from official records) 
for the w = 30 years 1913 一 1942 of the following four variables: 

x x — average yield of wheat (autumn sown) in kg per 10 4 m* for 
20 rural parishes in the district of Kalmar (Sweden). 

467 






31.3 


Table 31.3.8. 

Wheat yield, temperature and rainfall in the Kalmar district. 


Year 

Wheat 

yield 

怎 i 

Winter 

temperature 

Summer 

temperature 

RainfftU 

*4 

Best linear 
estimate of x x 

1913 

1990 

2.7 

128 

230 

2126 

, 1914 

1950 

3.1 

13.7 

268 

2205 

j 1915 

1630 

1.9 

12.0 

188 

1899 

| 1916 

1720 

1.8 

11.7 

316 

2068 

1917 

1660 

1.0 

12.7 

180 

1794 

1018 

1680 

1.6 

12.0 

261 

2004 

1919 

1980 

28 

12.2 

216 

2017 

1920 

2180 

1.7 

12.8 

346 

222S 

1921 

2370 

3.1 

]3l 

131 

1996 

1922 

1790 

1.1 

11.8 

266 

1918 

1923 

2400 

1.6 

11.2 

327 

2100 

1924 

1410 

0.1 

11.8 

320 

1913 

1926 

2670 

3.7 

13.2 

382 

2580 

1926 

2180 

1.1 

12.5 

279 

1996 

1927 

2160 

2.6 

122 

361 

2313 

1928 

2530 

0.8 

10 5 

324 


1929 

2100 

0.8 

10.9 

196 

1718 

1930 

2330 

36 

12.4 

381 

2629 

1931 

1850 

1.6 

10.7 

273 

1970 

1932 

2230 

19 

12.5 

289 

2123 

1933 

2610 

2.2 

11.9 

338 

2234 

1934 

2600 

3 0 

13 5 

267 

2271 

1935 

2430 

3.2 

12.8 

372 

2463 

1936 

1940 

2.8 

12.8 

367 

2370 

1937 

2770 

2.1 

13.6 

868 

2882 

1938 

2670 

3.8 

12.t 

202 

2164 

1939 

1 

2610 

3.8 

13.4 

311 

2401 

1940 

1420 

一 1.1 

11.3 

172 

1434 

1041 

810 

一 0.4 

11.8 

194 

1672 

1942 

1 歸 

-2.4 

11.2 

261 

14S4 


468 

















3!.3 


x t = mean Celsius temperature of the air at Kalmar during the 
preceding winter (October 一 March). 
x 9 = mean Celeius temperature of the air at Kalmar during the 
actual vegetation period (April — September). 
x 4 = total rainfall in mm during the vegetation period, average 
for three meteorological stations in the district. 

In this case it seems reasonable to regard the variables and 

x A as causes, each of which contributes more or less to the value of 
the yield x v It is required to investigate the nature of the causal 
relations between the variables. When the data are so few as in this 
example, we cannot hope to reach very precise results, but have to 
be satisfied with some general indications with respect to the signi¬ 
ficance or non-significance of the various possible influences. 

We shall assume that the joint distribution of the four variables 
is normal. The correlation matrix R = {r, ; ) of the sample is 


1 

0 59107 

0.41082 

0 4HI20 

0.69107 

1 

0.67028 

0 31838 

0.41082 

0.67028 

1 

0.10720 

0.46120 

0.81838 

0.10720 

1 


The determinant iR = | r ,； | is the square of the scatter coefficient 
(cf 22.7) of the sample. If the x t are independent, we have by (29.13.2) 
E (R) = 0 806 and D(R) approximately = 0.116. From the above matrix, 
we actually find R = 0.273, so that a dependence between the variables 
is clearly indicated. 

The significance of the various i\j may be judged by means of the 
distribution (29.7.5), which holds for r tJ if x t and x 3 are independent. 
According to (29.7.6), the hypothesis that x t and x 3 are independent 
will be disproved on the p % level, if exceeds the limit tp/Vtl + v 
where t p is the p % value of t for v = n — 2 d. of fr. A table of 
this limit for various values of n and p is pfiven by Fisher and Yates 


(Ref. 262). For the 
the limit are 

usual 5 %, 

1 % and 0.1 % 

levels, the 

D. of fr. 

_ /卜二 5 % 

p = 1 % 

p -- 0 1 % 

v — 26 

0 3740 

0.4786 

0 5880 

v — 21 

0 3073 

0 4706 

0.679() 

v = 2H 

0 3f»oy 


0 6703 


469 



31.3 

For our uj we have v = m — 2 = 28 d. of fr., so that all tuj except 
r u and r u exceed the 5 % limit. r, s lies between the 5 % and 1 % 
limits, and r 14 is almost equal to the 1 % limit, while r lt and r n 
even exceed the 0.1 % limit. It is interesting to not© that r lt is con¬ 
siderably larger than r 18 , which seems to indicate that the temperature 
of the last winter has a greater influence on the yield than the tem¬ 
perature of the summer. 

The partial correlation coefficients r, j. * may be calculated from 

(23.4.3) , and we find the following values : 

y*l2 3 == 0.4666 ，*13 2 0.O244 ftii 0.8670 

fi2.4 = 0.6281 rjs 4 = 0.4096 ru 3 = 0.4602 

For the significance limits of the r,i. t, we have by (29.13.5) an ex¬ 
pression of the same form as for the with v = n 3 = 27 d. of fr. 
Among the six coefficients given above, it is thus only rn .4 that ex¬ 
ceeds the 1 % limit，though both 3 and ru.s lie very close to this 
value. If we compare e. g. r ta = 0.41082 with the values given for 
>*13 2 and 9*18 4 , we find that the elimination of the influence of the 
winter temperature x t has reduced the correlation between the yield 
x v and the summer temperature x 8 to the completely insignificant value 
2 = 0.0244, while the elimination of the rainfall x A has practically 
no effect on the correlation. On the other hand, the comparison 
between r„ = 0.691 08 and rn 3 or rm shows that the correlation be¬ 
tween yield and winter temperature is not substantially reduced by 
the elimination of summer temperature or rainfall. With respect to 
r l4 , the situation is much the same as for r„. —— These comparisons 
seem to suggest the conjecture that the winter temperature x 9 and 
the rainfall x 4 are the really important factors, while the influence 
of the summer temperature x 8 is mainly due to the fact that x B is 
rather strongly correlated with x t (r M = 0.67028). 

The partial correlation coefficients with two secondary subscripts 
are calculated from (23.4.4). We find 

34 ― 0.8739, 24 == 0.0848 ， ”14.23 = 0.3660 ， 

and these values seem to support the above conjecture, though none 
of them is strictly significant. We have here v = n — 4 = 26 d. of fr., 
and the 5 % significance limit for Vij,ki is 0.3740. 

Consider now the multiple correlation coefficients. By means of 

(23.5.3) we find 


470 



n (M) = 0.6914, 
ri (234) = 0.6606. 


ri(24) = 0.6676, 


n (34) = 0.6872, 


31.3 


The comparison between r lg = 0.ft9ll and n( 23 ) = 0.6914 confirms the 
results already obtained, since it shows that the knowledge of x 9 
adds practically nothing to our information with respect to the yield 
when we already know x t . Similarly, the multiple correlation 
coefficient ri( 24 ) is not appreciably smaller than ri ( 234 ). 

If the variables x i? . . . y Xi are independent, the product n r\ ^ *•) 

is by (29.13.9) for largfe n approximately distributed in a ^^distribu¬ 
tion with k — 1 d. of fr. In the actual case, we find nrj( 34 ) = 10.344 
with 2 d. of fr., and wr 】（ 23 4 ) = 13.092 with 3 d. of fr. Since r x ( 2 3 ) and 
r i (24) are both greater than ri ( 34 )， it is thus seen that all four mul¬ 
tiple correlation coefficients given above are significantly greater than 
zero 

Finally, we find the partial regression coefficients 

bn 34 = 133 66, corresponding to t = 2.065, 

^i3 n = 44.87, » » t = 0.434, 

b\i 23 = 1 9963, » » ^ = 1 999 ， 


where the /-values are calculated from (29.12.1), under the hypothesis 
that the corresponding population values ^u.jk are zero. We have 



Fig. 32 Wheat yield x, : - . liesfc linear estimate x* 


471 


31.3 

26 cl. of fr. for t, and thus by Table 4 none of the three values is 
significant, though b xt and b u are very near the 5 % limit. If we 
identify the observed ^-values with the unknown population values, 
this would mean eg. that an increase of one degree in the mean 
winter temperature would on the average produce an increase of 
about 134 in the yield per 10 4 m*，summer temperature and rain¬ 
fall being equal, whereas the corresponding figure for an increase of 
on< k decree in the summer temperature would only amount to 45 kg. 

The equation of the sample regression plane for x v gives the best 
linear estimate of the observed values of x x in terms of x t , x s and x 4 : 

X* ™ 1 33 X t + 44 87 x s -f 1.9«63 X A 4* 730.9. 

The values of O'* calculated from this expression are given in the 
last column of Table 31 3 8. The values of x x and x 了 are also shown 
in Fig 32. 


It should be borne in mind that, in all tests treated above, we 
have throughout assumed that we are concerned with samples obtained 
b) T simple random mmpling (cf 25 2). This implies, i. a., that the sample 
values are supposed to be mutually indvptvdent. In many applications, 
however, situations arise where this assumption cannot be legitimately 
introduced Cases of this character occur, e. often in connection 
with the analysis of statistical time series. Unfortunately, considera- 
ii(»ns of space have prevented the realization of the original plan to 
inrlude in the present work a chapter on this subject, based on the 
uiatliomatical theory of random processes. A discussion of the subject 
will b<; found in the dissertation of Wold (Ref. 246 a). 


472 



Chapters 32—34. Theory of Estimation. 1 ) 


CHAPTER 32. 

Classification of Estimates. 

32.1. The problem. — In the preceding chapters, we have repeatedly 
encountered the problem of estimating certain population parameters 
by means of a set of sample values. We now proceed to a more 
systematic investigation of this subject. 

The theory of estimation was founded by R. A. Fisher in a series 
of fundamental papers (Ref. 89, 96, 103, 104 and others). In Chs 
32 — 33, we shall give an account of some of the main ideas introduced 
by Fisher, completing his results on certain points. In the present 
chapter, we shall be concerned with the classification and properties 
of various kinds of estimates. We shall then in Ch. 33 turn to con¬ 
sider some general methods of estimation, particularly the important 
method of maximum likelihood due to Ii. A. Fisher. Finally, Ch. 34 
will be devoted to an investigation of the possibility of using the 
estimates for drawing valid inferences with respect to the parameter 
values. 

Suppose that we are given a sample from a population, the distri¬ 
bution of "which has a known mathematical form, but involves a certain 
number of unknown parameters. There will then always be an infinite 
number of functions of the sample values that might be proposed as 
estimates of the parameters. The following question then arises* How 
should we best use the data to form estimates? This question immediately 
raises another: What do we mean by the 义 best» estimates? 

We might be tempted to answer that, evidently, the best estimate 
is the estimate falling nearest to the true value of the parameter to 
be estimated. However, it must be borne in mind that every estimate 
is a function of the sample values, and is thus to be regarded as an 
observed value of a certain random variable. Consequently we have 

*) A considerable part of the topics treated in these chapters arc highly contro 
versial, and the relative merits of the various concepts and methods discussed here 
are subject to divided opinions in the literature. 

473 




33 , 1-2 


no means of predicting the individual value assumed by the e*timate 
in a given particular case, so that the goodness of an estimate cuniiot 
be judged from individual values, but only from the distribution of the 
yalues which it will assume in the long ran, i. e. from it* sampling 
distribution. When the grreat bulk of the mass in thi* distribution it 
concentrated in some small neighbourhood of the true value, there is 
a great probability that the estimate will only differ from the true 
value by a small quantity. From this point of view, an estimate will 
be >better* in the same measure as its sampling diatribntion shows a 
greater concentration about the true value, and the above question maj 
be expressed in the following more precise form: Hew should tee use 
our data in order to obtain estimates of maximum concentration? 一 We 
shall take this question as the starting-point of our investigation. 

We have seen in Part II that the concentration (or the comple¬ 
mentary property : the dispersion) of a distribution may be measured 
in various ways, and that the choice between various measures i? to 
a threat extent arbitrary. The same arbitrariness will, of course, appear 
in the choice between various estimates. Any measure of disperiion 
corresponds to a definition of the »best» estimate, viz. the e*timate 
iiiat renders the dispersion as expressed by this particular measure 
as small as possible. 

In the sequel, we shall exclusively consider the meaflures of dis¬ 
persion and concentration associated with the variance and its multi¬ 
dimensional generalizations. This choice is in the first place based 
on the general arguments in favour of the least-squares principle ad¬ 
vanced in 15.6. Further, in the important case when the sampling 
distributions of our estimates are at least approximately normal, any 
reasonable measure of concentration will be determined by the second 
order moments, so that in this particular case the choice will be 
unique. — For a discussion of the theory from certain other points 
of view, the reader may be referred to papers by Pitman (Ref. 198, 
199) and Geary (Ref. 116 a). 

It will be convenient to consider first the case of samples from a 
population, the distribution of which contains a single unknown para¬ 
meter. This case will be treated in 32.2 一 32.5, while 32.6 一 32.7 will 
be devoted to questions involving several unknown parameters. An 
important generalization of the theory will be indicated in 32.8. 

32.2. Two lemmas. 一 We shall now prove two lemmas that will 
be required in the sequel. Each Jemma is concerned with one of the 



32.2 


two simple types of distributions, and there is a genei^al proposition 
of which both lemmas are particular cases. The general proposition 
will, however, not be given here. 

Lemma 1. Suppose that、for every a belonging to a non-degenerate 
interval A f the function g(x\ a) is a fr.f. in x t having the first moment 
xp(a)、and a finite second moment. Suppose further that，for almost 

all x, the partial derivative ^ exists for every a in A t and that 
da 


< G 0 (x) } where G 0 and x G 0 are integrable over (— 00 , 00 ). 
dtp 


Then the derivative — exists for every a in A } and we have 

00 00 

(32.2.1) J(x — aYg(x t a)dx ^ 


The sign of equality holds here, for a given mine of a, when and only 
when there exists a quantity k, which is independent of x but may de¬ 
pend on a, such that 

dl og 9 

d a 


(32.2.2) 


k(x 一 c) 


for almost all x satisfying g{x\ a) > 0. 

By hypothesis we have for every a in A 


(32.2.3) 


f g(x\ a)dx^=\ t f xg(x\ a)dx^\l){a\ 


and the conditions of 7.3 for differentiation under the integral sign 

are satisfied for both integrals, so that exists and is given by 

ua 

the expression 1 ) 


⑭一 [^ d JL 

J^c~J 


d x : 




00 

J (x — a)]^ 


d log 9 
da 


V^dx. 


*) If g{x\ a) = 0 for all a; in a certain interval, we must also have — =0, as 

otherwise g would assume negative values. The expression ~ p= should 

then be given the value zero. 


475 



32.2 

The relation (32.2.1) then immediately follows by an application of 
the Schwarz inequality (9.5.1). 1 ) 

In (9.5.1) the sign of equality holds when and only when there 
are two constants u and r，not both equal to zero, such that ug[x) 4- 
+ vA(a?) = 0 for almost all (P) values of x. Since (x 一 a) V^g canuot 
vanish for almost all x it follows that, for a g^iven value of a, the 
sign of equality holds in (32.2.1) when and only when 


for almost all x } where k is independent of x. This completes the 
proof of the lemma. 

We give two examples of casen where the relation (S2.2.2) is satisfied. Accord. 
* n K^y* ^ will be easily verified th»t in both these ca««8 the sign of equality holds in 
(32.2.1). 

Ex. 1. 


where a is 
r and a. 

Ex. 2. The By (18.1.6), the fr. f. k n (x) of the ^-distribatioD 

has the first moment n. TLins the fr. f. g (x, a) — ^ w ^ ere ^ 0, has the 

first moment yf (ne) = a, and we obtain from (18.1.3) --j.--- ^ — (x — te) for ail 

da 2a w 

x > 0 and a > 0. 


The normal distribution with, mean a and constant 8.d. Taking 

«)* 一^ = 
cV2n 

independent of x and a, we have v («) — « and = - 一 for all 

o a o 1 



Lemma 2. Suppose that、for every a belonging to a non-degenerate 
interval A y the finite or enumerable sequence of functions p x (a), p $ (a),... 
are the probabilities of a distribution of the discrete type、the corresponding 


姆 ss points u lt . . . being independent of a. Suppose further that the 
distribution has the first moment \p(a) and a finite second moment t and 
that the derivatives j)\ (a) exist for all t and for every a in A, and are 
such that the series 2u t pi(a) converges absolutely and unifw'tnly in A. 

— Then the derivative exists for every a in A, and we have 


(32.2.4) ? (…祕2(辦他⑶： 


')I am indebted to professor L. Ahlfors for a remark leading to a aimpliflcati 4 >n 
of my original proof of (32.2.1). 


476 



32.2-3 

The sign of equality holds here, for a given value of o, 'when and only 
when there exists a quantity k. which is independent of i but may de¬ 
pend on a y nuch that 

(32.2.5) = 

a a 

for all i satisfying p t (a) > 0. 

This is strictly analogous to Lemma 1, and is proved in the same 
way, by means of the following relations which correspond to (32 2 3) 

= 1. ^UtpAa)^^). 

t i 

As in the previous case, we give two examples of cases where the relation (32 2 f>; 
is satisfied, in both cases it will be easily verified that the Hign of equality holds 
in (32.2.4). 

Ex. 3. For the binomial distribution with p — a/n, we have u t ~ / and 
P t — («/w)*(l — ot/n)* 1 -*，where t = 0, 1,. . n Hence the mean is y>{a) 

d log Pl , n , 

ad(i we DilVO . == ~~~ • — - = — —• — ft4 **■* (C ). 

a a « n — a a(n— a、 1 

Ex. 4. When n — co while a remains fixed, the biDomial distnbiitioo tends to 
the Poisson dtifribution with u t «= t and P* ~ e~~ a . Here have y>{a) =cc and 

d log p t u t — a. 

— 一 - ■ S=1 .. 

da a 

32.3. Minimum variance ot an estimate. Efficient estimates. — 
Suppose that, to every value of the parameter a belonging to a non- 
degenerate interval A y there corresponds a certain d. f F(x, a) Let 
. . , jr n be a sample of n values from a population with the d f 
F(x, a), where a may have any value in -4，and let it be required to 
estimate the unknown ^true value，of a We shall use the general 
notation «* = rr* (x lt , x») for any function of the sample values 1 ) 

proposed as an estimate of a 

In the paragraphs 32.3 一 32 4, the size n of the sample will be 
considered as a fixed number ^ 1. In 32.5, we proceed to consider 

*) It is important to observe the different 8ignificAtion of the symbols a »na a 
By detinition, a* is a function of the sample values x,, ， r n , which arc conceived 
as random variables Thus a* U itself a random variable, possessing a certain 
sampling distribution. On the other hand, a in a variable t?i the ordinary analytic 
^n$e which, m the population correspondiDK to a given sample, may ussunie any 
conHtant, though possibly unknown, value in A. 


477 



32.3 


questions related to the asymptotic behaviour of our estimates when 
n is large. 

According to the terminology introduced in 27.6, a 0 is called an 
unbiased estimate of or, if we have E(a 0 ) = a. As shown by some 
simple examples in 27.6, it is often possible to remove the bias of an 
estimate by applying a simple correction, so that an unbiased estimate 
is obtained. In the general case, however, an estimate will have a 
certain bias b(a) depending on a t so that we have 

E(a*) = a -f fe (a). 


It can be shown that, subject to certain general conditions of regu¬ 
larity ^ the mean square deviation E (a • — a)* can never fall below a po¬ 
sitive limit depending only on the d.f. F(x\ a), the size n of the sample^ 
and the bias b(a). In the particular case when a* is unbiased whatever 
be the true value of a in A y the bias b(a) is identically zero、and it 
follows that the variance D % (a*) can never fall below a certain limit 
depending only on F and n. 

We shall restrict ourselves to proving this theorem for the case 
when the d. f. F(x\ a) belongs to one of the two simple types. 


1. The continuous type. 一 Consider a distribution of the continuous 
type, with the fr. f. f(pc\ a), where a may have any value in A. The 
values jc lt . . x n obtained in n independent drawings from this distri¬ 
bution are independent random variables, all of which have the same 
fr. f. f(x\ cr). £ach particular sample will be represented by a definite 
point * = [x u . . x n ) in the sample space R n of the variables x l% . . x n> 
and the probability element oi the joint distribution is 


L{x x , . . Xn ； a) dx x . . . dx n a) . . . f(x n \ a) dx x . . . dx n . 

The joint fr. f. L =/(x l ] a) . . .f(x n \ a) ia known as the likelihood 
function of the sample (cf 33.2). 

Let now a* = x n ) be a unique function of 

not depending on a, which is continuous and has continuous partial 

derivatives -z — in all points x, except possibly in certain points be- 
o oc% 


longing to a finite number of hypersurfaces. We propose to use a* 
as an estimate of a, and suppose that E (cr•) = a 十 b (a), so that & (a) 
is the bias of a*. 

The equation a* — c will, for various values of c, define a family 
of hypersurfaces in R n> and a point in R n may be uniquely deter- 


478 



32.3 


mined by the value of a* corresponding to the particular hypersur¬ 
face to which the point belongs, and by 匁一 1 »local» coordinates 
&’•• • ， ?n.-i which determine the position of the point on the hypersur¬ 
face. We may now consider the transformation by which the old vari¬ 
ables . . . y x n are replaced by the new variables a* and 匕， ....fn — i. 
Choosing the »local 》 coordinates 5 / such that the transformation 
satisfies the conditions A) and B) of 22.2, the joint fr f of the new 
variables will then be 

/( 怎 1 , a) • • a)|J|, 

where J is the Jacobian of the transformation, and the x t bave to 
be replaced by their expressions in terms of the new variables. 

The random variable a* will have a certain distribution, in general 
dependent on the parameter a, and we denote the corresponding fr f. 
bj g(a*\ «). Further, the joint conditional distribution of • ，客 ，》 -i ， 

corresponding to a given value of «*，will have a fr f. which we 
denote by /i(f" ， 5»>-i I «) By (22.1 1) we then have 

(32.3.1) a) . . f(x n \ a)\J\ =g(a* t a)h(^ u • f«-i |«*, a), 

and the transformation of the probability element according to (22 2 3} 
may thus be written 

(32.3.2) /(^! ， a) . . f(x n ; a)dx Y . . . dxn^ 

=g [a* \ «) 々 (?"•.，|«*, u)(lce* . "匕一 i 

Suppose now that, for almost all values of jr, a*, ^ n ~i, the 

partial derivatives ^ and ^ exist for every a in .4, and that 
r da da da ^ 

y^<F a {x\ |^|<G 0 (a*), 

where F 0 , (r 0 , a* G 0 and H 0 are integrable over the whole space of 
the variables x, a*, a* and ! 孤 ，…， $ n >i respectively. We shall then say 
that we are concerned with a regular estimation case of the continuous 
type 、and a* will be called a regular estimate of a .——We now pro¬ 
ceed to prove the following main theorem 

In any regular estimation ease of the continuous type，the mean 
square deviation of the estimate a* from the true value a satisfies the 
inequality 


479 



32.3 


(32.3.3) 




dby 

da) 



f(x\ a)d.( 


The sign of equality holds here, for every a in A, when and only when 
the following two conditions are satisfied whenever g{a*\ a) > 0\ 

A) The fr.f. hd . . • ， a) is independent of a. 

B) We have ^ ^ - = A ： (a* — or), where k is independent of a* but 

O a 

may depend on a. 


In the particular case when a* is unbiased whatever be the value of 
a in A t we have b (or) = 0, and (32.3.3) reduces to 

(32.3.3 a) D*(=*)S ^ -- 


— oo 


fdx 


From our assumptions concerning the functions / and A, it follows 
according to 7.3 that the relations 

oo oe oo 

f f(x\ a)dx = f • f hd • . , §n-i |a *； a)d^ t . . . rf? n -i = 1 

— OO — OO 一 00 

may be differentiated with respect to a under the integrals. The re¬ 
sulting relations may be written 

oo 

(32 3.4) ;«)</* = 

— oo 

oo oo 

=J f dl ffa h/l ^" ' - 5»-l|o *； a)rf|, . . . dSn-1 = 0. 

— 00 -00 

Taking the logarithmic derivatives with respect to a on both sides 
of (32.3.1) we obtain, the Jacobian J being independent of or, 

(32 3.5) y = 

V 1 Zi da da T da 

i 

We now square both members of this relation, multiply bj (32.3.2), 

480 



3 乂 3 


and integrate over the whole space. According io (32.3.4) all terms 
involving products of two different derivatives vanish, and we obtain 

，*/ a > dx ^ f ( d a ) rfo * + 

— 00 — OO 

(32.3.6) 

+ / <jd^ j s a 咖 《' 

— OO —OO —OO —oo 

The above proof of this inequality is due to Dugu^ (Ref. 76). The 

sign of equality holds here when and only when — = 0 in almost 

all points such that ^ > 0, i. e. when the condition A) is satisfied. 

Finally, the fr. f. g(o*\ a) satisfies the conditions of Lemma 1 of 
the preceding paragraph, with i//(a) = a 十 ft (a), and an application of 
that lemma to the inequality (32.3.6) now immediately completes the 
proof of the theorem. 

The integral occurring in the denominators of the second members 
of (32.3.3) and (32.3.3 a) may be expressed in any of the equivalent 
forms 

— oo 一 oc 

It will be readily seen that the above theorem remains true when 
we consider samples from a multidimensional population, specified by 
a fr. f. f(x u . . Xk\ a) containing the unknown parameter x. 

Consider now the case when the estimate o* is regular and un¬ 
biased. The second member of (32.3.3 a) then represents the smallest 
possible value of the variance D 8 (a*). The ratio between this minimum 
value and the actual value of D* (a*) will be called the pfficiency of 
a m t and will be denoted by e(a 0 ). We then always have 0 ^ e(a*) S 1. 
When the sign of equality holds in (32.3.3 a) } the variance D* (a*) 
attains its smallest possible value, and we have e(a # ) = l. In this 
case we shall say that a* is an efficient estimate 1 ). These concepts are 
due to R. A. Fisher (Ref. 89, 96). 

*) Ab a rule this term is used with reference to the behaviour of an estimate in 
large samples, i. e. for ioAnitely increasing values of n. However, we shall here find 
it convenient to distinguish between an efficient estimate, by which we mean mi 

481 



32.3 


It foliows from the above theorem that a regular and unbiased 
estimate is efficient, when and only when the conditions A) and B) are 
satisfied. This becomes evident, if c(a*) is written in the form 


(32 3 7) e(a*) - 


Min D 2 (c*) 


d log 


E 


da 




Both factors in the last expression are S 1， and the efficiency attains 
its maximum value I when and only when both factors are = 1. The 
first factor is = 1 when and only when the condition A) of the above 
theorem is satisfied, while the second factor has the same relation to 
condition B). 一 When an efficient estimate exists, it can always be 
found by the method of maximum hkehood due to R. A. Fisher (cf 33.2). 

Let now cr* be an efficient estimate, while a* is any regular un¬ 
biased estimate of efficiency e > 0 We shall shou，that the correlation 
coefficient of a* ami a* m q (a* y a*) — \^e In fact, the regular unbiased 
estimate a* = (1 — k) a* ka\ has the variance 

D- (，)=(( 1 一 々 f + 2 e + ^jo s (a：) = 

=(1 + 2 々 Q ~^ e + k ' dep 士 — 1 ) D >:)， 

and q V e } the coefficient of D 2 (a*) can always be rendered < 1 
by giving k a sufficiently small positive or negative value. Then it 
would follow that D 2 (a*) < D 2 (a*), and the efficiency of o* would be 
> 1， which is impossible. 

In particular for e—1 we have ^ = 1. Thus two efficient esti¬ 
mates «* and aj have the same mean or, the same variance, and the 
correlation coefficient ^ == 1. It then follows from 21.7 that the total 


estimate of minimum variance for a given finite size n of the sample, and an 
asymptotically efficient estimate (ef 32.6), which has the analogous property for samples 
of infinitely increasing size. An efficient estimate exists only under rather restrictire 
conditions (of 32.4), whereas the existence of an asymplotically efficient estimate ran 
he proved as soon as certain general regularity conditions are satisfied (cf 33 3). 

482 



32.3 


mass in the joint distribution of and is situated on the line 
at = oj. Thus two efficient estimates of the same parameter are 、 almost 
dways* egytffL 

We show in this paragraph several examples of efficient estimates (Ex. 1 — 2 for 
the continuous case, Ex. 6 — 8 for the discrete case). It will be left to the reader 
to verify that, in each case, the conditions A) and B) for efficient estimates are 
satisfied. In order to do this — we talk here of the continuous case, but in the 
discrete case everything is analogous — be will first have to find the fr f. q (a*, «) 
of the estimate concerned, and then the examples given in 32.2 will directly provide 
the veritication of condition B) Further, a convenient set of auxiliary variables 
should be introduced, and the conditional fr. f. h should be calculated 
from (32 3.1), it then only remains to verify that h is independeDt of a. — In all 
examples, except in Ex. 4, we are dealing with regular estimates only. The reader 
should verify this in detail at least in some cases 


Ex. 


The mean of a normal population. Writing 


/(x, m)= 




(at—»«)* 


where a = m is the parameter to be estimated, while 0 la a known constant, we 
may choose for A any finite interval, and obtain 


E 



f doc — --r * 
O 


Consequently the variance of any regular unbiased estimate m* satisfies the inequality 
^ <x a /w. For the particular estimate m* = x == i x J n we have by 27 2 
E 1^5) = m and D*(J) = a 1 In. so that the mean is an efficient ettmate of m. 

Accordingly we have seen above that certain other possible estimates of m. such 
as the sample median (cf 28.6), and the mean of the v th values from the top and 
from the bottom of the sample ；cf 28.6.17) have a larger variance than x. 

It is instrnctive to consider various other functions of the sample values that 
might be used as unbiased estimates of m, it will be found that the \anance is 
always at least equal to a 1 in We gi\e here a simple example of this kind. Con 
wider a sample of n = 3 values from the normal distribution specified above, and let 
the tample values be arranged in order of magnitude, x, '-1 ^ x $ . It might then 

b« thought that the weighted mean 

z = cr l f (1 — 2 c) jr t + cx t 

would, for some conveniently chosen value of c, be a »bctter» estimate of m than 
the simple arithmetic mean, which corresponds <0 c = J We have, however, E(z) — m 
and 

D 1 ( 2 ) = ; + (2 7T — 3 (r* — I)*, 

o TC 

so that the yariance of z attains its minimum precisely when c = J 一 It 鶚 ill he 
left as an exercise for the reader to prove this formula, and to verify that the con¬ 
ditions for a regular estimate are satisfied in this case •一 

483 



32.3 


ExVi. The variance of a normal population. Writing 

、、here a = a* is the parameter to be estimated, while m is a known constant, we 
may choose for A rny finite interval a < a 1 <b with a > 0, and obtain 


f 讲 /( 


2a* 




2a\ 


) fdx = 


2 <x* 


Consequently the variance of any regular unbiased estimate of a* is at least equal to 
2 a 4 /n. Correcting the sample ▼ariance 丨 , for bias (cf 27.6), we obtain the expression 



"一 1 S ( x » 一 至 )’， which by (27.4.6) is an unbiased estimate of a 9 with the 


variance 2 a 4 /(n — 1). ObTiously this is not an efficient estim»fc«, bnt an estimate of 

efficiency (n — l)’n< 1. On the other hand, consider the estimate #； = ^ (x, — 

This is legitimate, since m is now a known constant. It is easily seen that has 
the mean o % snd the variance 2 a*/n, and thus provides an efficient «atimate of a % . 


£x. 3. The 9.d. of a normal population. If, in the distnbutioa of Ex. 2, we 
regard the a. d. a instead of the variance a 1 as the parameter to be estimated, we find 


E 




2 

o' 


Consequently the variance of any regular unbiased estimate of a is at least equal to 
tfV(2 n). Consider t. g the expression 



\\here 8 is the s. d of the sample. Hy (2i) 8 3) e have E(8 r ) = a, and 


1 >V): 


2 ri (i) 


a' ■ 


o' 


n + °( b )' 


so that the efficiency e («') teods to 1 as n -• « 
ever, considerably smaller than 1. Taking e. g n 


For small n the efficiency is, how* 
= 2, we have e {s f ) — j 0 4S80 » 


while for n = 3 we have f(8') = 


6 (4 — 7i) 


0.6100 


Similarlyfind thnt the expression 


484 




32.3 



where « 0 ia defined in Ex. 2, is nn unbiased estimate of cr, with variance 



The efficiency e («i) tends to 1 as n — ». For n = 2 we have ^(#i) ― = 0 . w61 ， 

while for n = 3 we have eW) - r ： = 0.98M, considerably above the corre- 

3 \,3 7t 一 o ； 

spondiDg figures for . 

For the mean deviation «, =* - X | ^ — m |, we find by easy calculations 
it 

so that V 7r/2«, is an unbiased estimate of o t with the efficiency — = 0 87«o. 

n — i 

£x. 4 . A non-regular case. When the fr. f. has discontinuity points, the posi¬ 
tion of which depends on the parameter, th® conditions for a regular case are usually 
not satisfied. In such cases, it is often possible to find unbiased estimates of •ab¬ 
normally high* precision, i.e. such that the yariance is smaller than the lower limit 
given by (32.3.3 a) for regular estimates. 

Consider e. g. the fr.f. defined by /(*; a) * «« 一 * for x ^ a, »nd /(*; «) = 0 for 
x < a. In the point * = a the derivative does not exist, so that this is a non- 
regular case. As we have seen in 7.3, the relation f f dr =* 1 cannot in this case 
be differentiated in the u*nal simple way, we have, in fact, I ^ rfx *= 1. When 

we pass from (82.3.6) to (32.8.6), all the n 1 terms in the first member will thus be 
equal to 1. AssamiDg that the functions g and h satisfy our conditions, we then 
obtain instead of D % (a*) ^ 1/w, which would follow from (82 3 3 a), only the weaker 
inequality D , («*) ^ 1/n*. 

For the particular estimate a* = Min x t ■— 1/n. where Min x t denotes the smallest 
of the sample valneei, we And the fr.f. n/(n a 0 ; nee 一 1), so that E (a*) — «, 
D 4 («•) = 1/n 1 . Thus a* is an unbiased estimate, the variance of which is for all 
n > 1 smaller thah the limit given by (32 8.3 a). 

A further example of the same character in provided by the rectangular distri- 
bation, when we use the mean or the difference of the extreme values of the sample 
as estimates of the mean or the range of the population. According to (28.6 8) ami 
(28.6.9), the variance is in both cases of the order n-3, and thus certainly falls below 
the limit given by (82.3.8 a), when n is large. 

485 



32.3 


2. The discrete type. — Consider a discrete distribution with the mass 
points and the corresponding ： probabilities p x (o), p f (o), .. 

where a may have any value in A, and the u, are independent of a. 
This case is largely analogous to the previous case, and will be treated 
somewhat briefly. As in the previous case, we consider an estimate 
a* == a 0 (x i% . . x n ) with the mean E(a^) = a + ft (a). 

The probability that the sample point in R n with the coordinates 
^i，•••,»»» assumes the particular position M determined by x x = w,,, 
• • . ， x n = Ui n is equal to i)»,(a) . . . pi n (a). The point M may, however, 
also be determined by another set of n coordinates, viz. by the value 
assumed bj o* in M t say at, and by «— 1 further coordinates 
A, • • • ， y„-i which determine the position of M on the hypersurface 
a* = at. If q v (a) denotes the probability that a* takes the value at, 
while r* 1# .* n _j|*(o) is the conditional probability of the set of yalues 

of v" • • y n -i corresponding to M, for a given y, we have the fol¬ 
lowing relation which corresponds to (32.3.2): 

<32.3.8) p, t (a) . . . p , n ( a ) = 5 ， (o)r,„ |,(o). 

We now define a regular estimation case of the discrete type by the 
condition that, for every a in A, all deriTatiyes p[ (cr), ql (a) and 
, ， ”-il，( a ) exist and are such that the series ^pt(a) etc., which 

i 

correspond to the analogous integrals considered in the continuous 
case, converge absolutely and uniformly in A. We shall then also call 
o* a regular estimate of a. 

Jw any regular estimation case of the discrete type, toe have the 
inequality corresponding to (32.3.3): 


(32.3.9) 


E[a* - a)* S 


(⑷ , 




The sign of equality holds here, for evei'y a in A t when and only when 
the following two conditions are satisfied whenever q v (a) > 0: 

A) The conditional probability r Vl . * n-1 | v (a) is independent of a. 

B) We have ^ 心 =A; (or: - a), where k is independent of v but 
may depend on a. 


486 



32.3 


In the particular case when a. is unbiased whatever be, the value of 
a in we have b (a) = 0, and (32.3.9) reduces to 


(32.3.9 a) 




1 




The proof of this theorem follows the same lines as the corre¬ 
sponding proof in the continuous case. We take the logarithmic 
derivatives on both sides of (32.3.8), square, multiply by (32.3.8), and 
then sum over all possible sample points M. By means of Lemma 2 
of the preceding paragraph, the truth of the theorem then follows. 

As in the continuous ca«e, an unbiased estimate will be called 
efficient, when the sign of equality holds in (32.3.9 a). The definition 
of the efficiency of an estimate, and the remarks concerning the cor¬ 
relation between various estimates, extend themselves with obvious 
modifications to the discrete case. 

The expressions (32.3.3 a) and (82.3.9 a) are particular cases of the general ioeqaality 

1 


D*(a*) ^ 


dF 


which holds, under certain conditions, even for a d. f. F{x\ a) not belonging to one 
of the two simple types. The integral appearing here is of a type known as Hell- 
inger’8 integral (cf e. g. Hobson, Ref. 17, I, p. 609). We shall not go into this matter 
here, but proceed to give some farther examples of efficient estimates. 

F 5. For the binomial distribution we have , p i «= p l where a~p 

i& the parameter to be estimated, while N is a known integer, and g = 1 — p. Then 

N 


碟 ) VfW )、 


PQ 


Thus the variance of any regular unbiased estimate p m from a sample of n values is 

p q 5 1 

at least equal to For the particular estimate j?* J x t we find 

E (p m ) =» p and D % (p*) »= ~y> so that this is an efficient estimate. 


. • 

Ex. 6. For the Poisson distribution with the parameter A we have p t = ~ t 


and 




Thus the variance of any regular unbiased estimate is at least equal to )./n. For the 
particular estimate X* — x ^ ^ we have E (A*) = A and 0«(A*)-A/n, so that 
this is an efficient estimate. 


487 



32.4 


32.4. Sufficient estimates. 一 In order that a regular unbiased 
estimate «* should be efficient^ i. e. of minimum variance, it is necessary 
and sufficient that the conditions A) and B) of the preceding para- 
grapli are both satisfied. If we only require that condition A) should 
be satisfied, we obtain a wider .class of estimates. We now proceed 
to consider this class, restricting- ourselves to distributions of the 
continuous type, the discrete case being perfectly analogous. 

For the continuous case, condition A) requires that the conditional 
fr. f. h ^ n -i I a *； cr) should be independent of or, whenever 

g (or*; or) > 0. This means that the distribution of mass in the infini¬ 
tesimal domain bounded by two adjacent hypersurfaces a* and a* + da* 
is independent of a. In such a .case, the estimate may be said to 
summarize all the relevant information contained in the sample with 
respect to the parameter or. In fact, when we know the value of a* 
corresponding to our sample, saj the sample point M must lie on 
the hypersurface a* = aj, and the conditional distribution on this 
hypersurface is independent of a, so that the further specification of 
the position of M does not give any new information with respect 
to a. Using the terminology introduced by R. A. Fisher (Ref. 89, 96), 
we shall then call a* a sufficient estimate. Since in (32.3.1) the Ja¬ 
cobian J is independent of a ； it follows that a* is sufficient if and 
only if 

(32.4.1) f(x ly a) . . .f(x n ； a) = g(a*\ a)H(x t , . . x n \ 
where H is independent of a. 

From the nature of the conditions A) and B), it is fairly evident 
that efficient or sufficient estimates can only be expected to exist for 
rather special classes of populations. There are important connections 
between these classes of estimates, when they exist, and the maximum 
likelihood method (cf 33.2). 

For further information concerning the conditions of existence 
and other properties of efficient and sufficient estimates r the reader 
is referred to papers by R A. Fisher (Ref 89, 9G. 103, 104 etc.), 
Neyman (Ref. 162), Neyman and E. S. Pearson (Kef. 173), Koopman 
(Ref 141), Darmois (Ref. 74), Dugue (Ref. 76) and others. 

In Ex 1, 2, 5 and 6 of the preceding paragraph, we have considered various 
examples of efficient estimates. All these are, a fortiori, sufficient estimates. In 
oaeh case, this can be directly shown by studying the transformation which replaces 
the original sample variables by the estimate a* and n 一 1 further conveniently 
choseu new variables, and verifying that condition A) is satisfied. The reader is 


488 



32.4-5 


recommended to carry out these trnnsformatioiis in detail. (Cf a)4o the analogous 
cnse in 32.6, Ex. 1.) 

The estimate 8q defined in 32.3, £x. 3, is an example of a regular unbiased 
estimate satisfying condition A) but not condition B), i. e. a sufficient estimate which 
is not efficient. A further example of the same kind will be given in 33.8, Ex. 3. 
Thus the class of sufficient estimntes is effectively more general than the class of 
efficient estimates. 

The above definition of a sufficient estimate, which applies to the clnss of 
regular and unbiased estimates, may be directly extended to the class of all regular 
estimates, whether unbiased or not After this extension, it follows immediately 
from the detioition that the property of sufflciency is invariant under a change of 
triable in the parameter. Thus if a* ia a sufficient estimate of the parameter tx, 
and if we replace a by a new parameter <p{a), then y («•) will he a sufficient esti¬ 
mate of <p(cc). For efficient e»timates, there is no corresponding proposition. 

32.5. Asymptotically efficient estimates. 一 In the preceding para¬ 
graphs, we have considered the size n of the sample as a fixed in¬ 
teger ^ 1. Let us now suppose that the regular unbiased estimate 
a 4 = a* (x { , . . a:,,) ia defined for all sufficiently large values of w, 

and let us consider the asymptotic behaviour of a* as n tends to 
infinity. 

If a* converges in probability to a aB n tends to infinity, a* is a 
consistent estimate of a (cf 27.6) 一 In Chs 27 一 29, we have seen (cf 
e. g. 27.7 and 28.4) that in many important cases the s. d. of an esti¬ 
mate a* is of order for large w, so that we have D (a*)c>o cn'"'^ 
where c is a constant. If a 0 is unbiased and has a s. d. of this form ， 
it is obvious that a* is consistent (cf 20.4). Further, in such a case 
the efficiency e{a*) defined by (32.3.7) tends to a definite limit as n 
tends to infinity: 

In the discrete case we obtain an analogous expression. This limit is 
called the asymptotic efficiency of a*. Obviously 0 ^ e 0 ( a *) = 1- 

Consider further the important case of an estimate a*, whether 
regular and unbiased or not, which for large n is asymptotically nor¬ 
mal (a, cfV~n). We have seen in 28.4 that this situation may arise 
even in cases when E(a*) and D(c*) do not exist. However, when n 
is large, the distribution of a* will then for practical purposes be 
equivalent to a normal distribution with the mean a and the s. d. 
cjVn, and accordingly we shall even in such cases denote the quantity 

489 


(32 5.1) lim e(n*) = ^(cr*) 

M-* OO 



32.5-6 


^o(«*) defined by the last member of (32.5.1) as the asymptotic effi¬ 
ciency of a*. 

When 〜(《•) = 1， we shall call a* an asymptotically efficient esti- 
mate of a. Under fairly general conditions, an asymptotically efficient 
estimate can be found by the method of maximum likelihood (cf 33.3). 

Ex. 1. For the normal distribution, the sample median may be used as an 
estimate of m, and by 28.6 this estimate has the asymptotic efficiency 2/n — O.MM. 
Thus if we estimate m by calculating the median from a sample of, say, n = 10 000 
observations,、ve obtain an estimate of the same precision as could be obtained by 
calculating the mean x from a sample of only 2n/n — 6366 observations. Never¬ 
theless, the median is sometimes preferable in practice, on account of tbe greater 
simplicity of its calculation. 

We may also use the arithmetic mean of the v.fch values from the top and from 
the bottom of the sample as an estimate of m. By (.28.6.17), this is an estimate of 
Asymptotic efficiency zero. 

When, in the normal distribution, m is known, and it is required to estimate the 
variance a* or the s.d. o, we may use various estimates connected with the sample 
variance s*. Ill Ex. 2 一 3 of 32 3, we have already met with some examples of 
asymptotically efficient estimates of this kiDd. — We may also use the difference 
between the v th values from the top and from the bottom of tbe sample, multiplied 
by an appropriate constant, as an estimate of a. According to (28 6.18), Ihis is an 
estimate of asymptotic efficiency zero The use of this estimate in large samples 
would thus involve a »loss of information^ even greater than in the case of the 
sample median mentioned above. Nevertheless, the estimates of a as well qs of m 
ljased on the v th values may often be used in practice with great advantage, as 
their calculation is very simple, and the loss of information is not considerable for 
small values of m (cf the papers quoted in this connection in 28.6). 


Ex. 2. For the Cauchy distribution with the fr. f. /(x, fx) = [1 -\- (x — /i) 1 ]-* 

、vc have 


j 


__(x — ju) 9 
[1 ~f (x — 




Thus the variance of any regular unbiased estimate of (x is at least equal to 2/n. 
By 10 2， the sample mean s has tlie same fr. f /{x , so that the mean is not a 
consistent estimate of •i. Neither is the arithmetic mean of tbe v th values from the 
top and from the bottom of the sample (cf 28 6.11). On the other hand, the sample 
median is by 28.5 asymptotically normal {jx, | and thus tbe median has the 

2 71 * 8 

asymptotic efficiency - . — = —r = 0 . 8106 . 

n 4n n* 


32.6. The case of two unknown parameters. 一 We shall now 
briefly indicate how the concepts and propositions given in the pre¬ 
ceding paragraphs may be generalized to cases involving several un¬ 
known parameters. It will be sufficient to give the explicit statements 

490 



32.6 


of the results for continuous distributions, as the corresponding results 
for the discrete case follow by analogy. In order to simplify the 
writing, we shall farther restrict ourselves to the case of unbiased 
estimates. 

In the present paragraph we shall consider a distribution with two 
unknown parameters a and /?, specified by a fr. f. f(x\ a 肩 . From a 
sample of n values x lt . . x n drawn from this distribution; we form 
two functions a* — a* (x^ . . M x n ) and 〆• = (x,, . . .r w ), which are 

assumed to be unbiased estimates of a and ^ respectively We then 
consider a transformation in the sample space H n , replacing the old 
variables a? 1} . . x n by n new variables a*, and ” ^,,- 2 - For 

this transformation we have the following relations corresponding to 
(32.3.1) and (32.3.2): 

jf[/(x t ,a y p) = g(a\ § M _ 2 1 «• ， /T; or , 外 



n 伽， or, 0)dx t ~ 

— 1 = gW; «，/?)/ 也， ...， aj)da*d^d^ . . . d^ n - 2 . 

Here g is the joint fr. f. of a* and /?*, while A is the conditional 
fr. f. of fn-j for given values of a* and /?•. Finally J is a 

Jacobian independent of a and 

A regular estimation case is now defined as a case where the fr. f.s 
/, g and h satisfy the regularity conditions stated in 32.3 with re¬ 
spect to both parameters a and /?. 

Operating in the same way as in 32.3, though dealing with total 
differentials with respect to a and p instead of partial derivatives 
with respect to cr, we obtain (cf Dugue, Ref. 76) 

(32.G.1) nf {p^ da+ ^f d ^f dx ^ 

/(心々吖卜 •#，， 

— 00 — 00 

where the sign of equality holds when and only when the conditional 
fr. f. h is independent of a and 〆， whenever g > 0. In a case where 
this condition is satisfied, the estimates a* and 〆• may be said to 
summarize all relevant information contained in the sample with re- 

491 



32.6 


spect to a and /?. In generalization of 32.4, we shall then say that 
a* and /?* are joint sufficient estimates of a and 

Both members of (32.6.1) are quadratic forms in da and dp. Owing 
to the homogeneity, the same inequality between the forms holds true 
even if da and dfi are replaced by any variables u and r, and thus 

(32.6.1) may be written 

(3262) « 卜 m “， + - - W ]" 


Consider now the inequality (32.2.1), which expresses the main 
result of Lemma 1 in 32.2, and suppose that \/j(a) = a. The inequality 

(32.2.1) may then be written as an inequality between two quadratic 


forms in one variable: 


E 






£(«• — «)* 


where g = ^(o *； a) is a fr. f. with the mean E (c # ) = a, and the form 
in the second member is the reciprocal of the form E(a m 一 o) 1 u*. 
When expressed in this way, the lemma may be generalized to fr. f：8 
involving several parameters (cf Cramer, Ref. 72; the detailed proof 
of this generalization will not be given here). In the case of two 
parameters, the generalized lemma asserts that the second member of 

(32.6.2) is at least equal to the reciprocal form of 

E(a* — or)* m* + 2 E [(«* — a) (/?* — /?)] uv + E(fi’ 一 〆)*!；* = 

=a? m 8 4 2 ga^iUV -f a\v l t 

where a,, a t and q denote the s. d:s and the correlation coefficient 
of a* and /?*, so that 


(32.b 3) 

1 ― Q f \a? a { <r s a\) 

Now the concentration ellipse of the joint distribution of «* and ^ 
has the equation (cf 21.10.1) 


492 



32.6 


_.4) 占 + ¥ 卜. 

The inequalities (32.6.2) and (32.6.3) thus imply that the fixed ellipse 

(32.6.5) « …) ，十 0^)(“-L 

+ E(^)*(^-/»)*] =4 

lies wholly within the concentration ellipse of any pair of regular un¬ 
biased estimates a • 、矿 • 一 This is the generalization to tiro parameters 
of the inequality (32.3.3 a). 

When the si^n of equality holds in both relations (32.6.2) and 
(32.6.3), we shall say that a* and /?• are joint efficient estimates of c 
and In this case the two ellipses (32.6.4) and (32.6.5) coincide, and 
the joint distribution of a* and ^ has a greater concentration (cf 21.10) 
than the distribution of any non-efficient pair of estimates. 

Consider now a pair of joint efficient estimates aj and ^ 0 . The 
variances of aj and /?*, and the correlation coefficient between these 
two estimates, are obtained by forming ： the reciprocal of the quadratic 
form in the first member of (32.6.5): 

0 >:)=忐 £ (^1^’， D ， ⑻卜忐 £ (色^)*， 

Id log f 0 log A 

0(a . _ 


where 




Hence we obtain e. g. 

D^a：)- 


i-e* («：,/?：) 




A8 soon as E - f 〆❹， the variance of a* is thus greater 

than the variance of an efficient estimate in the case when a is the 



32.6 


only unknown parameter (cf 32.3.3 a). Now, in a case when there are 
two unknown parameters it often arrives that we are only interested 
in estimating one of the parameters, say cc, and we may then ask if 
it would be possible to find some other pair of regular unbiased esti¬ 
mates a*, yielding a variance D 2 (a*) < no matter how large 

the corresponding D 2 (/?*) becomes. 

However, since the ellipse (32.6.5) lies wholly within the ellipse 
(32.6.4), the maximum value of the abscissa for all points of the 
former ellipse is at most equal to the corresponding maximum for 
the latter ellipse. Hence we obtain by some calculation the inequality 


which shows that it ts not possible to find a ^better» estimate of a 
than a*. 

The ratio between the two-dimensional variance (cf 22.7) of a pair 
of joint efficient estimates a:， 〆 and tbe corresponding quantity for 
any pair of regular unbiased estimates or' /?•，will be called the joint 
efficiency of a* and and denoted by e(a* } /?*). This is identical 
with the square of the ratio between the areas of the ellipses (32.6.5) 
and (32.6.4), which by (11.12.3) is 

e(a *^* ) = ^afa!Tl 


The concepts of asymptotic efficiency and asymptotically efficient estimate 
(cf 32.5) directly extend themselves to the present case. 

As in 32.3, all the above results remain true in the case when we 
consider samples from a multidimensional population, specified by a 
fr. f. f{x u . . Xk, a, /?) containing two unknown parameters. 


Ex. 1 . When both parameters « = w and /? = of a normal distribution are 


unknown, we have (cf 32 3, Ex. 1 — 2J 

d log/ 0 log 

d m d a* 



->0, 醬 T 


i 


so that in this case the optimum ellipse (32.6.5) becomes 
(u — m)* (v — a 1 )* 4 

--- -4 --- ■—. 

a 1 2 j 4 n 


Consequently this fixed ellipse lies >vithin the concentration ellipse of tbe joint 
distribution of any pair of regular unbiased estimates of m and a*. For the particular 

pair of estimates — x and 8* — -- the relation (29.3.6) shows tbe tran«- 

n — 1 


494 



32.6-7 


formation which replaces the sample variables r,, . . x n by the new variables x, a 
and z Jt . . 2 n _«| The last factor in the expression of the fr. f. of the new variables 

represents the conditional fr. f. of z x> . . z n _ a , and this is independent of the unknown 
parameters m and a (and, in fact, also of x and 8, but this is of no importnnce for 

our present purpose). Hence it follows that x and —■ s' are joint sufficient esti- 

n 一 1 

mates of m and Further, we have 

d’o 令， b* ^ e ( i ,-~ *») - o. 

Thus the concentration ellipse of x and — 8* has the equation 


(m — m) % n — 1, (v — q*) 1 _ 4 
~n 2 a 4 ~ n 

The square of the ratio between the areas of the two ellipses gives the value ^ 
for the joint efficiency of the estimates. When n~* oo f the efficiency tends to unity, 

_ n 

sind thus -r and - s 2 arc asyn»ptotical!y efficient estimates of m and o* The 
same holds, of course, also for : and s' though b 1 is not unbiased 

Ex. 2. CooHider a two-dimensional normal fr f. (21.12 1) with known values of 
and q, while a = m, and ^ ― m 8 arc the two unknown parameters From a 
sample of n pairs of values U"Vi)，• •> (*^ n » y n \ we form the estimates a* = x and 
— y It is then easily shown that in this rase the concentration ellipse of the 
estiointes x and y coincides with the fixed ellipse (32 6 5)，each having the equation 




~ m,) 1 一 2 y — m,)(v — w a ) + {u — 

O? o t O a tff / 


Thus x and // are joint efficient estmiateH (aud n fortiori joint sufficient est!nmte») 
of m, and 


32.7. Several unknown parameters. 一 The results of the preceding 
paragraphs may be generalized to distributions involving any number 
of unknown parameters. If a*, . . at are any regular unbiahed esti¬ 
mates of the k unknown parameters a,, . . , it is shown in a similar 
way as in the case ^ = 2 that the fixed ^-dimensional ellipsoid 

{32.7.1) « 2 E d 1 = A + 2 

lies wholly within the concentration ellipsoid (cf 22.7) of tbe joint 
distribution of at, . . cc*. In the limiting case when the two ellipsoids 
coincide, we shall say that a*, . . af are joint efficient estimates of 
«i，. . , cti- Thus the distribution of a set of joint efficient estimates 

495 



32.7-8 


has a greater concentration (cf 22.7) than the distribution of any set 
of non efficient estimates. The moment matrix of a set of joint effi¬ 
cient estimates is the reciprocal of the matrix of the quadratic form 
in the first member of (32.7.1)，as shown in the preceding paragraph 
for the case of two parameters. — The concepts of sufficiency, 
efficiency, etc. are introduced in the same way as in the case k — 2. 

As an example, we consider a two-dimensional normal fr. f with the five 
unknown parameters m,, m it fi iot fz tl and From a sample of n pairs of values 

(a*,, y,), . . (x n , y n ), we obtain the unbiased estimates J ， p, — ^ - m so , —— — m" and 

—m 0 , for the five parameters (cf 29 6). The moment matrix of the joint distri- 
n 一 1 

bution of the five estimates can be calculated e. g. by means of the expression (29.6.3) 
of the joint c. f. of the estimates. Further, the coefficients in the equation (32 7.1) 
of the optimum ellipsoid may be found by introducing the expression of the fr. f. 
into (32.7.1) and performing the integrations. By simple, though somewhat tedious 
calculations, it will be found that the joint efficiency of the five estimates is 

* When n— », this tends to unity, so that the estimates are asymptotically 

efficient. 


32.8. Generalization. 一 Throughout the present chapter, we have 
been concerned with the problem of estimating certain parameters 
from a set of values, obtained by independent drawings from a fixed 
distribution. However, our methods are applicable under more general 
conditions. Consider e. g. the following problem : 

The variables x x , . . x n have a joint distribution in H n , with the 
fr. f. f(x i} . . x n \ <r) of known mathematical form, containing the un¬ 
known parameter a. An observed point * = (cr!，• . • ， x n ) is known, and 
it is required to find the »best possible» estimate a* == a* (x u . x n ) 
of a by means of the observed coordinates x t . 

In the particular case when the joint fr f. is of the form 
f(x x ; a) . . f(x n \ a), this reduces to the problem treated in 32.3, where 
the Xi are independent variables having the same distribution. The 
general set-up covers e. g-. also the cases when the x, are correlated, or 
when they consist of several independent samples from different dis¬ 
tributions. Even in the general case, we talk of the point x = (x,, . . x„) 

as a sample point, which is represented in the sample space R 私 . 

We now consider the same transformation of variables in the 
sample space as in (32.3.1) and (32.3.2). In the present case, however, 
we have to introduce the general form of the joint fr. f. into the 
formulae expressing- the transformation, so that e. g. (32.3 2) becomes 

49 (> 



32.8-33.1 


f(x iy . . x n \ a)dx i . . . dx n = 

= g[a 0 \ a) W • ” a)da* d^ x . . . d^n-i. 


The whole argument of 32.3 一 32.5 (continuous case) now applies almost 
without modification, and in this way the concepts of unbiased, effi¬ 
cient and sufficient estimates etc. are extended to the present general 
case. Thus e. g. the generalized form of the inequality (32.3.3 a) for 
the variance of an unbiased estimate is 


DV)S W / (r„ ..^ n ; a) 仏 .」〜] - : 

=W^)T' 


and when the sign of equality holds here, we call a* an efficient 
estimate. When the conditional fr. f. h is independent of a, we call 
«* a sufficient estimate, etc. 

The sa^ie generalization may evidently be applied to cases of 
discrete distributions, and to distributions containing several unknown 
parameters. 


CHAPTER 33. 

Methods of Estimation. 

33.1. The method of moments. — We now proceed to discuss 
some general methods of forming estimates of the parameters of a 
distribution by means of a set of sample values. 

The oldest general method proposed for this purpose is the method 
of moments introduced by K. Pearson (Ref. 180, 182, 184 and other 
works), and extensively used by him and his school. This method 
consists in equating a convenient number of the sample moments to 
the corresponding moments of the distribution, which are functions 
of the unknown parameters. By considering as many moments as 
there are parameters to be estimated, and solving the resulting equa¬ 
tions with respect to the parameters, estimates of the latter are ob¬ 
tained. This method often leads to comparatively simple calculations 
in practice. 

The estimates obtained in this way from a set of n sample values 
are functions of the sample moments, and certain properties of their 

497 



33.1-2 


sampling ： distributions may be inferred from Chs 27 一 28. Thus we 
have seen (cf in particular 27.7 and 28.4) that, under fairly general 
conditions, the distribution of an estimate of this kind will be asymp¬ 
totically normal for large and that the mean of the estimate will 
differ from the true value of the parameter by a quantity of order 
W 1 , while the s. d. will be asymptotically of the form dVn. By a 
simple correction, we may often remove the bias of such an estimate, 
and thus obtain an unbiased estimate (cf 27.6). 

Under general conditions, the method of moments will thus yield 
estimates such that the asymptotic efBciency defined in 32 5 (or the 
corresponding quantity in the case of several parameters) exists. As 
pointed out by R. A. Fisher (Ref. 89), this quantity is, however, often 
considerably less than 1, which implies tbat the estimates given by the 
method of moments are not the »best» possible from the efficiency 
point of view, i. e. they do not have the smallest possible variance 
in large samples. Nevertheless, on account of its practical expediency 
the method will often render good service. Sometimes the estimates 
t^iveu by the method of moments may be used as first approximations, 
from which further estimates of higher efficiency may be determined 
by means of other methods. 

In the particular ca«e of the normal distribution, the method of moments gives 
the estimates .r and u a for the unkoowu parameters m and <j s . Correcting tor bias, 
we obtain the uubiased and asymptoticaiiy efficient (cf 32 0, Ex. 1) estimates a anti 

H ~ It was shown by Fisher (Kef. 89) that, m this respect, the normal di.ntribu* 
w — 1 

tion is exceptional among the distributions belonging to the Pearson system (cf 19 4), 
the asymptotic efficiency in other cases being as a rule leas than 1. Some examples 
will be given in 33 3. 

33.2. The method of maximum likelihood. — From a theoretical 
point of view, the most important general method of estimation so 
far known is the method of maximum likelihood. In particular cases, 
this method was already used by Gauss (Ref. 16); as a general method 
of estimation it was first introduced by R. A. Fisher in a short paper 
(Ref. 87) of 1912, and has afterwards been further developed in a 
series of works (Eef. 89, 96, 103, 104 etc.) by the same author. Im¬ 
portant contributions have also been made by others, and we refer in 
this connection particularly to Dugu6 (Ref. 76). 

Using the notations of 32.3, we define the likelihood functum L of 
a sample of n values from a population of the continuous type by the 
relation 


498 



33.2 


(33.2.1 a) L(x { , . . } x n , a) =/(x Jt a) . ./(x n \ o), 

while in the discrete case we write 

(33.2 1 b) L(x l} . ,x ni a) =/>,,(«) . . p, n (or). 

When the sample values are priven, the likelihood function L becomes 
a function of the single variable a The method of maximum likeli- 
hood now consists in choosing, as an estimate of the unknown popula- 
tion value of or, tbe particular value that renders L as great as poss 
ible. Since log L attains its maximum for the same value of a as L, 
we thus have to solve the likelihood equation 

(33.2.2) fMosJ, = 

da 


with respect to a. Let us agree to disregard any root of the form 
a — const” thus counting as a solution only a root which effectively 
depends on the sample values Any solution of the likelihood 

equation will then be called a maximum likelihood estimate of a 

In the present paragraph, we shall consider some properties of the 
maximum likelihood method for samples of a fixed size n, while in the 
next paragraph the asymptotic behaviour of maximum likelihood esti¬ 
mates for large values of n will be investigated. — The importance 
of the method is clearly shown by the two following propositions- 
J f an efficient estimate a* of ce ex 我 the likelihood equation will 
have a unique solution equal to a*. 

If a sufficient estimate a* of a exists、any solution of the likelihood 
equation will be a function of a*. 


It; will be sufficient to prove these propositions for the continuous 
case, the modifications required for the discrete case being obvious. 
When an efficient estimate a* exists, the conditions A) and B) stated 
in connection with (32.3.3 a) are satisfied, and thus by (32.3.5) we have 


d log L 
da 


2 


^ USKJL 


da 


k{a* — a), 


where k is independent of the sample values, but may depend on a. 
According to our convention with respect to the solutions of the 
likelihood equation (33.2.2), this equation will thus have the unique 
solution a = a*. 


499 



33.2-3 


Farther, when a sufficient estimate a* exists, condition A) of 32.3 
is satisfied, and by (32.3.5) the likelihood equation then reduces to 

d log L = d log g (a* , a ) 

— da — 8a 一 _ 一 

The function g depends only on the two arguments a* and a, and 
thus any solution will be a function of a*. 

The above definitions and propositions' may be directly generalized 
to the case of several unknown parameters, and to samples from 
multidimensional distributions. Thus e. g. for a continuous distribu¬ 
tion with two unknown parameters a and the likelihood function is 
L(z lt . . X n \ «，/?) = n 爪 ; a, 芦 )， and the maximum likelihood esti¬ 
mates of a and /S will be given hy the solutions of the simultaneous 
equations ^ ^ ^ = 0， with respect to a and /?. When 

a pair of joint efficient estimates a* and exists, the likelihood 
equations will have the unique solution a = «• ， 〆 = 卢 •. 

The maximum likelihood method may even be applied in the general 
situation considered in 32.8 In this case, the method consists in 
choosing as our estimate the value of or that renders the joint fr. f. 
f(x u . . x n \ a) as large as possible for given values of the x,. 

Some examples will be given in the next paragraph. 


33.3. Asymptotic properties of maximum likelihood estimates. — 
We now proceed to investigate the asymptotic behaviour of maximum 
likelihood estimates for large values of n. We first consider the case 
of a single unknown parameter a. 

It will be shown that, under certain general conditions, the likelihood 
equation (33.2 2) has a solution which converges in probability to the true 
value of a, as w — oo. This solution is an asymptotically normal and 
asymptotically efficient estimate of a. 

As before, it will be sufficient to g：ive the proof for the case of 
a continuous distribution, specified by the fr. f. f(x\ a). We shall use 
a method of proof indicated by Dugue (Ref. 76). 一 Suppose that the 
following- conditions are satisfied: 

1) For almost all x, fche derivatives 9 !_?-/ an d 

da a a J o a 

exist for every a belonging to a non-degenerate interval A. 



33.3 


2) For every a in A, we have 


d /\ <F ^ x) ' ^i^-F'.Wand 

| 普 | < H (a?), the fonctions F x and F t being infceg^rable over 
00 

(— 00 , oo), while JH(x)f(x\ a)dx < M y where M is independent of a. 

— oo 

oo 

3} For every a in A, the integral J fdx is finite and 

— 00 

positive. 

We now denote by or 0 the unknown true value of the parameter 
a in the distribution from which we are sampling, and we suppose that 
a 0 is an inner point of A. We shall then first show that the likelihood 
equation (33.2.2) has a solution which converges in probability to a 0 . 
— For every nr in A we have, indicating by the subscript 0 that a 
should be put equal to a 0y 

^f"’ = (fl + ( a - a 。) ㈣ -a。) • 脉 ) ’ 

where |^| < 1. Thus the likelihood equation (33.2.2) may, after multi¬ 
plication by 1/w, be written in the form 


(33.3.1) 


1 d log L 


+ B x (a 一 a 0 ) 4 - \8B t [a — « 0 )* = 0, 


n da 

where, writing f t in the place of f(x,; a), 


(33.3.2) 


鲜)。 

(^)* 


The B v are functions of the random variables x v . x n , and we now 
have to show that, with a probability tending to 1 as « oc，the 
equation (33.3.1) has a root a between the limits a 0 ± 6, however 
small the positive quantity d is chosen. 

Let us consider the behaviobr of the B % for large values of 
From the conditions 1) and 2) it follows (cf 32 3 4) that 

501 



33.3 



for every « in .4, and hence we obtain 

E (^)r I(j^)o f(x ' a ° )dx ^ 0 

— oc 

_ 3 ) [ i .0 -(语) 2 丄户， # 

= - E (^):=-々 S 

where by condition 3) we have k > 0 Thus by (33.3.2) B 0 is the 
arithmetic mean of n independent random variables, all having the 
same distribution with the mean value zero. By Khintchine's theorem 
20 5， it follows that B 0 converges in probability to zero In the same 
way we find that converges in probability to — P, while B 2 con¬ 
verges in probability to the non negative value EH(x) < M. 

Let now 6 and e be given arbitrarily small positive numbers, and 
let P(S) denote the joint pr. f. of the random variables j*,, . x n . For 
all sufficiently large w ， say for all n > == w 0 (d,«), we then have 

P 2 = P(B t ^-ik 2 )<\e. 
Ps^P(\B,\^2M)<ie. 

Let further S denote the set of all points x = ( 怎星， .. ， x n ) such that 
all three inequalities 

|-»ol<^ \B 9 \<2M y 

are satisfied. The complementary set S* consists of all points x such 
that at least one of these three inequalities is mt satisfied, and thus 
we have by (6.2.2) 

P(iS*) ^ 1\ + jP s + P a < £， and hence P(S) > 1 一 e. 

Thus the probability that the point x belongs to the set S, which is 
identical with the P-measure of is > l 一 8, as soon as n > w 0 (d, 右 ) 


502 



33.3 


For « = a 0 ± d, the second member of (33 3.1) assumes the values 
B 0 ± Bid In every point x belonging to S, the sum of 

the first and third terms of this expression is smaller in absolute 
value than ( 十 1)d 2 , while we have B { 6 <— \ 6 If d < j kV(M -f 1), 
the sign of the whole expression will thus for « = « 0 ± d be determined 


by the second term, so that we have 


d log ： L 
da 


> 0 for « = 


o 0 — and 


d log L 
du 


< 0 for a = a Q + 6 


Further, by condition 1) the function 


d log L 

da 


is for almost all x : 


Xn) a continuous function of 


in A Thus for arbitrarily small 6 and e the likelihood equation will, 
with a probability exceeding* 1 一 £， have a root between the limits 
a 0 ± d as soon as n > n 0 ((5, f), and consequently the first part of the 
proof is completed. 

Next, let a* == a* . . , x n ) be the solution of the likelihood 
equation, the existence of which has just been established. From 
(33 3 1) and (33 3 2) we obtain 


(33 3 4) 


丄 y ( d J2Sl\ 

__ ^ \ )o 

IV n (a* - a 0 ) = _ — \0 B^a' ~ a 0 )l ' 


It follows from the above that the denominator of the fraction in 
the second member converges in probability to 1. Further, by (33.3.3) 

is a variable with the mean zero and the s. d. k. By the 
o 

is then 
o 

asymptotically normal (0, k V^)，and consequently the numerator in the 
second member of (33 3 4) is asymptotically normal (0, 1). 

Finally, it now follows from the convergence theorem of 20.6 that 
JcVn (a* 一 a 0 ) is asymptotically normal (0, 1), so that a* is asymptotic- 


Lindeberg-Levy theorem (cf 17.4), the sum M dl z £ ) 



ally normal (a 0 , clV~n\ where 1/c* == = E 

asymptotic efficiency of a* is then 


( m ' 办 ( 325 1 )加 



503 



33.3 


and thus our theorem is proved. The corresponding theorem for a 
discrete distribution is proved in the same way. 

In the case of several unknown parameters, we have to introduce 
conditions which form a straightforward generalization of the condi¬ 
tions 1) 一 3). It is then proved in the same way as above, using the 
multi-dimensional form of the Lindebergf Levy theorem (cf 21.11 and 
24.7), that the likelihood equations have a system of solutions which 
are asymptotically normal and joint asymptotically efficient estimates 
of the parameters. 

Ex. 1. For a sample of n values from a normal distribution with the unknown 
parameters m and a 1 , the logarithm of the likelihood function is 

log L = — (x t — w) f — J n log a* ~ ^ n log 2 n, 


and the maximum likelihood method gives the equations 


d log L 
d m 


I (a：, - m) = 0, 


0 log L 

一 


2 a 


: i S (** •一 w )* 一 


2 a* 


0. 


Hence we obtain the maximum likelihood estimates 

^ 2 3C» = ir, (a*f = - X (3T, ~ ?)* = «*, 
n * n 

which coincide with the estimates given by the method of moments We have already 
seen (cf 28.4 and 32 6, Ex. 1) that theae estimates are asymptotically normal nnd 
asymptotically effleient. 

Ex. 2. Consider the type III distribution (cf 19.4) 

/U, A) = 6-^, (® > 0 , 又 > 0) 

with the unknown parameter 又 . For any finite interval a < X < b with a > 0 we 
may apply (.32 3 3 a), and thus And that the lower limit of the variance of a regular 
unbiased estimate of 又 from a sample of n values is (cf 12.3) 




•mm) 


In order to estimate A by the method of momenta, we equate the sample mean x to 
the first moment k of the distribution, and thus obtain the estimate A* = 5. We 
then easily find E (A*) = A, D* (A*) = Vn. Hence it follows by (32.3.7) and (12.5 4) 
that the efficiency of is independent of n and bas tho value 

504 




33.3 


««•)! 


1 


d* loir 厂 U) 


以化 /( 错々 


This is always less than 1, and tends to zero as 又 — 0. 
method of maximnm likelihood leads to the equation 


On the other band, the 


Id log L = 1 
n d X n 


E 】og A 


d log ru) 

4T~" 


and tho maximnm likelihood estimate is the unique positive root A = ).** of this 
equation. According to the general theorem proved above, X** is asymptotically 

Dormal |^A, |n ^ an ^ tbe M ymptotic efficiency of A** is equal to 1. This 

can also without difficulty be seen directly, since tbe variable log x has the mean 

f l 5*( 又 ) and the rariance ^ and thus (cf 17.4) by the Lindeberg-I^vy 

d a d a 

theorem $ 2 log x i is asymptotically normal ^ . 

Ex. 3. In the type III distribution 


f(x \ a) - 


a 1 

! /w 


x l-\ t -ax t (x > 0, a >0) 


we now consider 又 as a given positive constant, while a is tbe unknown parameter. 
We then have 


=E (卜如各. 


In this case, the method of moments and the method of maximum likelihood give 
the game estimate X/x for a. Correcting for bias, we obtain tbe unbiased estimate 

■- \ which has tho fr. f. 


a* 


g(a* \ «) s 


g n X ( n ^ — l)n ； 
； r{n X) 


(i)' 


i + J 


g-^a (n X— l}/a*. 


as is found without difficulty, e. g. by means of the c. f. (12.3.4). Supposing w A > 2 ， 
Ave then obtain E(a*) = a, D 1 (a*) = a*/(n X — 2), and 

Thus we have in this case n E ^ ^ = E , 80 tbnt tbe sign 

holds ia (32 3.0), which implies that condition A) of theorem (32.3 3) is satisfied. 
Hence it follows that a* is a sufficient estimate of a, and this may also be directly 
verified by means of (32.4 1). On the other hand, condition B) is not satisfied, since 

~ ^ not of the form k(a* — «). Accordiugly the efficiency of a* is 


505 



33.3-4 



so that «* is not e/Jicteni for any finite n (cf 82 8). Allowing n to tend to infinity 
we see, though, that a* is asymptotically efficient. 


33.4. The minimum method. 一 The % 2 minimum method dis¬ 
cussed in 30 3 is only available in the case of a grouped continuous 
distribution, or a discrete distribution. For large the estimates 
obtained by this method are asymptotically equivalent to those given 
by the simpler modified % l rnimmum method expressed by the equations 
(30.3.3) or (30.3.3 a), and we have already remarked in 30.3 that the 
latter method is, for the cases concerned, identical with the maximum 
likelihood method. 

The main theorem on the limiting distribution of % 2 when certain 
parameters are estimated from the sample has been proved in 30.3 
under the hypothesis that the method of estimation is the modified 
minimum method. However, we have stated in 30.3 that there is 
a whole class of methods of estimation leading to the same limiting 
distribution of We shall now prove this statement. 

Asymptotic expressions of the estimates obtained by the modified 
X Z minimum method have been given in an explicit form in (30.3.17), 
for the general case of s unknown parameters or l7 . . or,. Let us sup¬ 
pose that the conditions 1) 一 3) of the preceding paragraph — or the 
analoorou8 conditions for a discrete distribution 一 are satisfied. It 
then follows from the preceding- paragraph that the estimates (30.3.17) 
are asymptotically normal (this has, in fact, already been shown in 
30.3) and asymptotically efficient. 

Now in all sets of asymptotically normal and asymptotically efficient 
estimates of the parameters, the terms of order must agree, and 
thus will be the same as in (30.3.17). An inspection of the deduction 
of the limiting distribution of x* given in 30.3 shows, however, that 
this limiting distribution is entirely determined by the terms of order 

r 

in (30 3.17). In fact, by (30 3.1) and (30.3.4) we have 义’ = 艺 y?, 

i 

and (30.3.18) shows that the limiting distribution oi y == (〜...， Vr) 
is determined by the terms in question. 

It thus follows that the theorem of 30.3 on the limiting distribution 
of X % holds for any set of asymptotically normal and asymptotically effi¬ 
cient estimates of the parameters. 


506 



34.1-2 

CHAPTER 34. 

Confidence Regions. 

34.1. Introductory remarks. — Suppose that we are using a set 
of sampie values to form estimates of a certain number of unknown 
parameters in a distribution of known mathematical form. Suppose 
further that the sampling distributions of our estimates are known, 
so that the respective means, variances etc can be calculated. 

Are we, in such a situation, entitled to make some kind of prob¬ 
ability statements with respect to the unknown true values of the 
parameters? Will it, e g., be possible to assign two limits to a certain 
parameter, and to assert that, with some specified probability, the 
true value of the parameter will be situated between these limits? 

In the older literature of the subject, probability statements of 
this type were freely deduced by means of the famous theorem of Bayes, 
one of the typical problems treated in this way being- the classical 
problem of inverse probability (cf 34.2, Ex. 2). However, these applica¬ 
tions of Bayes* theorem have often been severely criticized, and there 
has appeared a growing tendency to avoid this kind of argument, and 
to reconsider the question from entirely new points of view. The at¬ 
tempts so far made in this direction Lave grouped themselves along 
two main lines of development, connected with the theory of fiducial 
probabilities due to R A. Fisher (cf e. g Ref. 14, 100, 102, 105 — 109) 
and the theory of confidence intervals due to J. Neyman (cf e. g. Ref. 
30, 161, 163， 165 -167). We shall here in the main have to restrict 
ourselves to a brief account of the latter theory 

In the next paragraph, we shall consider the case of a single 
unknown parameter, comparing the older treatment by means of Bayes’ 
theorem with the modern theory In 34 3, we then proceed to more 
general cases, and finally we discuss in 34 4 some examples. 

34.2. A single unknown parameter. 一 Consider a sample of v 

values r ,，• . from a distribution involving a single unknown para¬ 
meter «. We shall first suppose that the distribution is of the con¬ 
tinuous type, and has the fr. f. f(x\ a). For simplicity we suppose 
that f(x y a) is defined for all values of a. Let a* = a* x n ) be 

an estimate of a, with the fr. f. () (a*; a). 

Having calculated the value of a* from an actual sample, we now 
ask if it is possible to mike some reasonable probability statement 

507 



34.2 


with respect to the unknown value of a in the distribution from which 
the sample is drawn. The question will be considered from two funda¬ 
mentally different points of view. 

I. The classical method. In some cases, it may be legitimate to 
assume that tlie actual value of the parameter a in the sampled 
population has been determined by a random experiment. Cases of this 
character occur e. g. in the statistics of mass production, when a de¬ 
notes some unknown characteristic of a large batch of manufactured 
articles, which it is required to estimate from a small sample. The 
particular batch under consideration will then have to be regarded as 
an individual drawn from a population of similar batches, where the 
values of a are submitted to random fluctuations due to variations in 
the production process and the quality of raw materials. The drawing 
of one individual from this population of batches is the random ex¬ 
periment which determines the actual value of a. 一 Similar cases 
occur e. g. in certain genetical problems. 

In such cases, a is itself a random variable, having a certain a 
priori distribution. Let us assume that this distribution is defined by 
a known fr. f. cr (a). In the joint distribution of a and a*, the func¬ 
tion -cr(a) is then the marginal fr. f. of a, while g (a # ; a) is the condi¬ 
tional fr. f. of a* for a given value of a. Conversely, the conditional 
fr. f. of c, for a given value of a*, is by (21.4.10) 

•卜 . 
f t^(a)g(a 0 ; a) da 

— Co 

This relation expresses Bayes' theorem as applied to the present case. 
The quantity 

(34.2.1) P[k l < a < k t \a!*) = j h(a \ a*) da 

then represents the conditional probability of the event < a < k ft 
relative to a given value of a*. This probability is commonly known 
as the a posteriori probability of the event < a < k v as distinct 
from the a iiriori probability of the same event, which is equal to 

f nf (a) cl a. 

人 I 

By 14.3 and 21.4, the a posteriori probability (34.2.1) admits a 
frequency interpretation which runs as follows. Consider a sequence 

508 



34.2 


of a large number of independent trials, where each trial consists in 
drawing ： a batch from the population of batches, and then drawing 
a sample of n values from the batch (we use a terminology adapted 
to the example considered above, but the argument is evidently general). 
From the sample, we calculate the estimate «*, we further assume 
that it is possible to examine all the articles in the total batch, so 
that the corresponding value of u may be directly determined. The 
result of each trial will thus be a pair of observed values of the vari¬ 
ables « and u*. From the sequence of all trials, we now select the 
sub sequence formed by those cases where the observed value of a* 
belongs to some small neighbourhood of a value given in advance. 
The frequency ratio of the event k x < a < k t in this sub-sequence will 
then, within the limits of random fluctuations, be given by the value 
of the a posteriori probability (34 2.1) for a* — aj. 

The above is the direct frequency interpretation of tbe a posteriori probability. 
By a slight modiAcation of the argument, we may obtain a result which shows a 
greater formal resemblance to the theory of confidence intervals as given below. Let 
#• be given such that 0 < f < 1. To every given «*can then determine the 
limits k t == /r, («*, t) and A:, = A:, (a*, c) in (34.2 1) such that the probability 
P{k t < « < A:, I «*) takes the value 1 一 f (The reader may here consult Fig 33, p. 
oil, replacing r, and r t by k t and k t ) Consider now once more the above sequence 
of all trials, and let us calculate the limits = Ar, («•, c) and k t == k t («*, c) from 
the sample obtained Sn each trial. The interval (k it k t ) 'vdl then depend on u n , so 
that in general the successive trials will yield different intervals Let us in each 
trial count the occurrence of the event A, < a < /. a as a »success», and the occurrence 
of the opposite event as a »failures. The piobahility of a success is then constantly 
equal to 1 — 6, anil accordingly (cf 10 6) the frequency ratio of successes m a long 
series of trials should, nitbin the limits of random liuctuations, be equal to 1 — t 
The practical lmplicatioos of this result, in a cas>c where the method may be legitim 
ately applied, are similar to thone (liscusbed below 

2. The method of confidence intervals In a case where there are 
definite reasons to regard a as a random variable, with a known 
probability distribution, the application of the preceding method is 
perfectly legitimate, and leads to explicit probability statements about 
the value of « corresponding to a given sample. However, in the 
majority of cases occurring in practice, these conditions will not be 
satisfied. As a rule a is simply an unknown constant, and there is 
no evidence that the actual value of this constant has been determined 
by some procedure resembling a random experiment. Often there will 
even be evidence in the opposite direction, as e g. in cases where the 
a-values of various populations are subject to systematic variation in 

509 



34.2 


time or space. Moreover, even when a may be legitimately regarded 
as a random variable, we usually lack sufficient information about its* 
a priori distribution. 

It would thus be highly desirable to be able to approach the ques¬ 
tion without making any hypothesis about the random or non-random 
nature of the parameter a. Certain methods designed to meet this 
desideratum have been developed by the authors quoted in the pre¬ 
ceding paragraph, and we now proceed to show how the problem may 
be treated by the method of confidence intervals due to Neyman (1. c., 
cf also Wilks, Ref. 42, 234). In the present paragraph, we shall con¬ 
sider the question under certain simplifying assumptions, while more 
general cases will be dealt with in the next paragraph. 

We shall now consider a as a variable in the ordinary analytic 
sense, wmcn assumes a constant, though unknown value in the popula¬ 
tion from wnicn an actual sample has been drawn. The results thus 
obtamea will hold true whether the value of a has been determined 
by a raaaoin experiment or not, so that this method is actually of 
moie geneiai applicability than the preceding one. 

As befoie, we consider a sample of n values from a distribution 
with the fr. f. f(x, a), and we denote by g (a*; a) the fr. f. of the 
estimate = a* (x J} . . , x M ). Denote further by P(6\ a) the joint pr.f. 
of tne sample variables x lt . .， x n , and let e be given buch. that 0 < £ < 1. 

For every rixed «, the fr. f. ^(u*, a) defines the probability distri¬ 
bution of a' which may be interpreted as a distribution of a unit of 
maas on the vertical through the point (a, 0) in the (c, «*)-plane 
(cf Fig. 33). Suppose now that, for every value of a, two quan¬ 
tities y, = (o, «) and y t == y % (a, e) have been determined such that 

the quantity of mass belonging to the interval y l < a* < of the 
corresponding vertical 一 i. e. the probability of the event < a* < 
for the value a of the parameter 一 becomes 

(84 2.2) P (/j < «* < ^, or) =• jg (a* , a) da* ~ 1 — €. 

7i 

Obviously this can always be done, and there are even an infinity of 
possible ways of choosing y, and y 2 , since these quantities may be 
determined from the relations 

7t » 

J gda* = and f gda* == e t , 

— 00 y # 

where q and £, are any positive numbers such that 

510 




If we draw a sample of n values from a distribution corresponding 
to any value of or, the event < cr* < y ± will thus always have a 
probability equal to 1 — £ The quantities y x and y t depend on c, and 
when a varies, the points (c, y,) and (a, y 2 ) will describe two curves 
in the plane of (a, a*), as indicated m Fig. 33. We shall assume that 
each curve is cut in one single point by a parallell to the axis of a. 
Let the abscissae of the two points where the curves are cut by the 
horizontal through the point (0, a*) be c t = (c* t e) and = c t (a*, 
and let denote the domain situated between the curves. 一 Con¬ 
sider the three relations 

(34.2.3) (a,a 0 )<I)(e) y ^ (a, e) < a* < y t (a, e), c t (a\ s) < a < c g (a ¥ , s). 

For any fixed value of a, each of these relations is satisfied by a 
certain set of points * = (a;,, . . x n ) in the sample space. However ， 
the three relations are perfectly equivalent, since all three express 
the fact that the point (a, a*) belongs to the domain 7>(«). Thus the 
three sets in the sample space are identical, and consequently we obtain 
from (34.2.2) for every value of a 


(34.2.4) 


P(Ci < a < c 2 ; a) = 1 一 €. 
511 




34.2 


Both relations (34.2.2) and (34.2.4) give the value of the set function 
P(S\ a) for a certain set S in the sample space, which is defined in 
two different but equivalent ways, viz. by the two last relations (34.2.3). 
The first of these asserts that the random variable a* takes a value 
between the constant limits y x and y,. The last relation (34.2.3), on 
the other hand, asserts that the random variable c x (a*, i) takes a value 
smaller than a, while the random variable c,(a*，《) takes a value 
greater than a or, in other words, that the variable interval (c t , c 2 ) 
covers the fixed point a. According to (34.2.4), the probability of this 
event is equal to I — e, whatever the value of a. 

Consider now a sequence of independent trials, where each trial 
consists in drawing a sample of n values from a population with the 
fr. f. f{x\ a), the values of a corresponding to the successive trials 
being at liberty kept constant or allowed to vary in a perfectly ar¬ 
bitrary way, random or non-random. From each set of sample values, 
we calculate the quantities c x = (a*, e) and c, = c g (o*, using the 

value of e given in advance. In general, c, and c t will have different 
values in different trials. Each trial will be counted as a »success», 
if the corresponding interval (q, c a ) covers the corresponding point a, 
and otherwise as a »failure*. By (34.2.4), the probability of a success 
is then constantly equal to 1 — and accordingly (cf 16.6) the fre¬ 
quency ratio of successes in a long sequence of trials will, within the 
limits of random fluctuations, be equal to 1 — e. 

Suppose now that we apply constantly the following rule of behaviour. 
We first choose once for all some small value of c, say e ==j?/100. When¬ 
ever a sample has been drawn、and the corresponding limits c x and c t 
have been calculated, we further state that the unknown value of a in 
the corresponding population is situated between c, and c r — According to 
the above，we shall then always have the probability e = p/\00 of giving 
a wrong statement. In the long run, our statements tvill thus be tor on q 
in about p % of all cases，and otherwise correct. 

The interval (c ly c g ) will be called a confidence interval for the para¬ 
meter a, corresponding to the confidence coefficient 1 一 e % or the con¬ 
science level e = jp/100. The quantities and c, are the corresponding 
confidence limits. 

Comparing this mode of treatment with the one based on Bayes' 
theorem, it will be seen that the method of confidence intervals is 
entirely free from any hypothesis with respect to the random or non- 
random nature of a. On the other hand, it follows from this very 
generality that the method does trot lead to probability statements of 

512 



34.2 


the type: »The probability that a is situated between such and such 
fixed limits is equal to 1 — £». In fact, such a statement has no sense 
except when a is a random variable. The statements provided by the 
method of confidence intervals are of the type of the relation (34 2.4 )， 
which expressed in words becomes : »The probability that such and 
such limits (which may vary from sample to sample) include between 
them the parameter value a corresponding to the actual sample, is 
equal to 1 — e». As shown above, we may deduce from this statement 
a rule of behaviour associated with a constant risk of error €, where f 
may be arbitrarily fixed. 

It must be observed that the system of confidence intervals corres¬ 
ponding to a given e is not unique. Just as we may consider various 
different estimates of the same parameter a, we may also have various 
systems of confidence intervals, leading to different rules of behaviour, 
all associated with the same risk of error e. This is by no means 
contradictory. As we have seen above, the confidence intervals obtained 
by applying a given rule will vary from sample to sample, and it is 
perfectly natural that, for a given sample, different rules in ay yield 
different intervals (cf Ex. 1 below). 

Obviously it will be in our interest to find rules which, under given 
circumstances, yield as short confidence intervals as possible. Suppose 
e. g. that we are dealing* with estimates a* which are unbiased and 
approximately normally distributed. The strip D (f) in Fig. 33 will 
then be made as narrow as possible by choosing- for a* an estimate 
of minimum variance. Thus the classes of efficient and asymptotically 
efficient estimates studied in Ch. 32 will, under fairly general condi¬ 
tions, lead to the shortest or asymptotically shortest confidence inter¬ 
vals. We cannot go further into this subject here, but the reader is 
referred to papers by Nejman (Ref. 165) and Wilks (Ref. 233). 

We finally observe that the above definitions and arguments apply 
even in the case of a discrete distribution involving a single unknown 
parameter a. However, there is one important modification to be made 
in this case. When the distribution on the vertical through the point 
(or, 0) in Fig. 33 has discrete mass points, the limits y x and y t cannot 
always be determined such that P(y 1 < a 0 < y 2 ; or) == 1 — e as required 
bj (34.2.2). We shall have to be satisfied with choosing y Y and y t 
such that P(y t < a* < y 2 ; a) ^ 1 — which is evidently always pos¬ 
sible. The strip D (e) and the confidence interval (c 1? c,) are then 
determined as in the continuous case. The risk of committing an 
error when stating that a belongs to (c^ c t ) is in this case not exactly 

513 



34.2 


equal to £, but at most equal to e. With this exception, everything is 
perfectly similar to the continuous case. 

Ex. 1. Let it be required to estimate the mean m of a normal population with 
a known 8. d. a. Replacing in Fig. 38 a and a* by m and m*, first consider the 
efficient estimate m* ^ = S x ( /n, which is normal (m, ol y~n). For the confidence 
level e = p/100, the limits y x and y t in Fig. 33 may be pat equal to w 土 
where is the p % value of a normal deviate. The curves forming the boundary 
of the domain D{e) will then be the straight lines x ~ wi ± X p a/y^n. The relations 

m — afy^n < x < m + a/Y rt, 

n < m < x + 

are evidently equivalent, so that the limits c, and c x are equal to x ± afYn The 
rule which consists ia assertlDg, whenever a sample baa been drawn, that the un¬ 
known meun m is.situated between the limits i 士 af]/ r n is thus associated with a 
constant risk of error equal to p %. 

We have, in fact, already encountered this interval in 31.3, £x. 2. We have 
seen there that, working on a p % level of significance, the hypothesis that tbe mean 
of the distribution has a value c given m advance will be regarded as consistent 
with the data when c ia situated between the confidence limits !r ± A p a/}/ n, while 
otherwise it will be rejected. 

Suppose, on the other hand, that we consider the Don-efiicient estimate m* «= z, 
where z is the sample median. By 28.6, z ia asymptotically normal (m, k ofYn\ where 
k =-1^nl2 = 1.26M. Let us, for the sake of the argument, assume that the error of 
approximation can be neglected, so that the distribution ia exactly normal. Each of 
the equivalent relations 

m ~ k^ v aiy n<z<m-\-kX p afY n and z — k crl n < m < z k X p at\ r n 
then bas a probability of p %, and consequently we obtain in this case the p % con¬ 
fidence limits z ± k X p alY n. From a given Bam pie, we thus obtain different con¬ 
fidence intervals for m, according as we apply the rule founded on x or on z. Never¬ 
theless tbe risk of error is th« same in both cases, if we are using the same Tslue 
of e. Obviously tbe former rule will always give & shorter interval than tbe latter. 

Ex. 2. Suppose that we have made n repetitions of a random experiment, and 
that a certain event E has occurred v times. 1( is required to estimate the aoknown 
probability p of E. This is the classical problem of inverse probability y which is 
treated in the majority of text-books by means of Bayes' theorem. 

We shall here apply tbe theory of confidence intervals to the problem, and con¬ 
sider the efficient estimate (cf 32.3, Ex. 6) p* — v/n, which is asymptotically normal 
(p, Vpq/n), where q = 1 — p. Taking the limits y x and y, equal to p ± A \^pqin and 
assuming, oa in the preceding example, that the distribution is exactly ncurmal, Fig. 
33 will take tbe form indicated io Fig. 34. The domain D{e) ia here bounded by 
the curves p 4 ~ p ± ^ Vpqln, which form tbe two halves of an ellipse, A being the 
100 e % value of a normal deviate. The fact that a point (p,p*) i® sitanted inside 

514 



P * •含 


34.2 



Fig. 34. Confidence intervals for an unknown probability, n 100, e = 0.06. 

the ellipse may be expressed by saying that p* lies between the limits 丨土 A Vpqhx 
or by the equivalent statement that p lies between the limits 



The latter limits determine a 100 e % confidence interval for p. 

This result is, of course, only approximate, since in reality p* has a discrete 
distribution which is only approximately normal. E. S. Pearson and Clopper (Ref. 
196) have given graphs baaed on the exact distribution and permitting a determina- 
tion of confidence intervals for the 6 % and 1 % levels. As Pearson and Clopper 
point out, their graphs may be used i. a. to determine the value of w which in ne¬ 
cessary to provide a desired degree of accuracy in the estimation of p. 8uppose, 
e. g.，that p is about 60 %, and that we want a confidence interval of length at most 
«qual to 6. From the approximate solution (34.2.6) we obtain, taking p* = J, 

•7=^ 各 汐， or n > A* —, T - ‘ 

|/n 十 A* 

Taking e. g. J = c =» O.oi, thin gives n > 66340. 

Ex. 3. Suppose that we h»ve a population consisting of a finite number N of 
individuals, Np of which possess a certain attribute A, while the remaining 
Nq = N ^ Np do not posBesa A. It is now required to estimate the unknown pro¬ 
portion p by the representative method (cf 25 7). Let us draw a random sample of 
n individuals without replacement, and observe the number v of individuals in the 
sample possessing the attribute A. In current text books on probability, it is shown 
that we have (cf e. g. Cramer, Ref. 10, p. 38) 


515 




34 . 2-3 


E 


0 


P> 


D t 


(n) 


N --n p q 


Further, the variable p* ― v/n is approximately normally distributed, when n and 
N — n are large. Taking p* as an estimate of p, we now assume as above that the 
error of approximation involved in the normal distribution can be neglected. The 

"I / N — n p q 、 

probability that p* lies between the limits 尸土 A \ then equal to c, 

where A has the same significance as in the preceding example Thus we obtain 
confidence limits for the unknown proportion p simply by sabstitating in (34.2.6) 


N~ 1 
N — n 


n for n 


34.3. The general case. — The theory of confidence intervals 
developed in the preceding paragraph is easily extended to more 
general cases. Consider a distribution of the continuous type con¬ 
taining k unknown parameters a,, . . o*, and suppose that we draw 

a sample of n values from this distribution. 

The sample variables will as usual be regarded as the coordinates 
of a point x = (x l5 . . x n ) in the w-dimensional sample space R n , and 
similarly the set of parameters of an actual distribution will be re¬ 
presented by the point « = (a n • . a*) in a ^-dimensional parametric 

space Pic. For simplicity we suppose that the distribution is defined 
for all points a of P*, and we denote the joint pr. f. of the variables 

x *!，. . x n by P{S\ «), where S is a set in the sample space R n . 

For the following developments, it is not necessary to suppose that 
the variables x u . . ,r» are independent variables all having the same 

distribution. With a similar generalization as in 32.8 we may, in 
fact, allow P(S\ «) to denote any w-dimensional pr. f. of the continuous 
type, which is defined for all parametric points a — (a lt . a*). 

To every parametric point a in we may determine a set 5(«) 
of points x in R n sucli that 

(34.3.1) P[* C S (ft), «] ― 1 一 e, 

where e is given in advance. — The set S(a) corresponds to the in¬ 
terval y { < a* < y t in Fig. 33, l ) and the relation (34.3.1) corresponds 
to (34.2 2). Further, the set J) of a,ll points («, *) in the product 
space Pk - R n such that the relation x < S(a) is satisfied, corresponds 
to the domain D (f) in Fig. 33. For every point * in H n , we now 
consider the set 2 (a:) of all points a in Pk such that (a, x) < I). 

l ) 'Ve may here regard Fig. 33 as concerned with a sample of one single observed 
value a* from a distribution with the fr. f. g(a* , a). 

516 



34.3-4 


Then 2(*) corresponds to the interval rj < a < r 2 of Fig. 33, and the 
three relations 

(«, x) < D y x < S («), a eh (*), 

are equivalent, for the same reasons as the corresponding relations 
(34.2.3). Hence we obtain the analogue of (34.2.4): 

(34.3.2) P[« CSW; a] = l - e. 

The further development is exactly similar to the preceding particular 
case. If we draw repeatedly samples of n from distributions of the 
given type, the corresponding parametric points a being at liberty 
kept constant or allowed to vary in a perfectly arbitrary way, and if 
for every sample we state that the actual parametric point ft belongs 
to the set S («) corresponding to the sample, we shall in each case 
have the probability e = p/100 of being wrong. Consequently in the 
long run our statements will be wrong in about p % of all cases. 

The set Z(«) will be called a confidence region for the parametric 
point a, corresponding to the confidence coefficient 1 —— e, or the con¬ 
fidence level e=p/i00. If, in particular, the set 2(*) is an interval 
in Pk defined by one single relation of the form 

(34.3.3) c t (*, e) < a r < c t (*, e), 

where r is one of the subscripts 1, . . while c, and c 2 are inde¬ 
pendent of a,, . . a*, we shall call S(jr) a confidence interval for the 
parametei' a r . The last definition evidently includes the corresponding 
definition of the preceding paragraph as a particular case. More 
generally, if the set £(jt) is a cylinder set (cf 3.5), the base of which 
is a set in the subspace of the parameters a,, . . a r , where r <k 、we 
shall say that S(*) is a confidence region for the parameters a t) ..or r 
With respect to the generalization to distributions containing 1 dis¬ 
crete mass points, the remarks of the preceding paragraph apply even 
in the present general case. Finally, the generalization to samples 
from multi-dimensional distributions is immediate. 

34.4. Examples. 一 In 31.3, Ex. 6, we have already encountered 
some confidence intervals for coefficients of regression and correlation 
in the case of samples from a two-dimensional normal distribution. 
We shall now discuss some further examples, which will give rise to 
comments on certain points of general interest. 

517 



34.4 


£x. 1. The mean of a normal distribution. When x u . . Xn are a 
get of sample values from a normal distribution with unknown para¬ 
meters m and a, the ratio (cf 29.4) 



has Student's distribution with w ■— 1 d. of fr., the corresponding fr. f. 
being For any interval (t\ the relation 

(B4.4.1) ^~<t" 

t n 

has thus the probability j8 n -i{t)dt y which is independent of the 

parameters m apd cr, and by an appropriate choice of t r and 广 this 
can be made to assume any given value 1 一 

Suppose that ( and t n are fixed. For every parametric point (wi, or), 
the relation (34.4.1) then defines a set of points x in the sample space 
which corresponds to the set 5(«) of the preceding paragraph. How¬ 
ever, (34.4.1) may also be written in the equivalent form 


(34.4.2) 


£ — t f, 


Vn — l 


<m<x~t , 




For any fixed point * = (x u . . a? w ) in the sample space, this relation 
defines an interval in the parametric space, which is independent of a, 
and is thus of the form (34.3.3), where a r has been replaced by m. 
According to the definition of the preceding paragraph, (34.4.2) thus 
provides a confidence interval for the mean m, and we have the following* 
relation corresponding to (34.3.2): 


(34.4.3) Pix-t n 


Vn 一 1 


< m < x —— ( 


Vn — 1 


； m, <Tj == JS n -l 


(t)dt. 


Thus if we draw repeatedly samples of n from normal populations, 
the yalaes of m and a corresponding to the successive samples beings 
at liberty kept constant or allowed to vary in an arbitrary way, and 
if for every sample we calculate the confidence limits x — t n sfVn — \ 
and x — i slVn — 1, the frequency of those cases where m is included 

between the limits will in the long run be approximately equal to 

r 4 

f s n ^i(t)dt. 

t' 


518 



34.4 


Every choice of ( and t n yields, according to (34.4.3), a rule for 

calculating confidence intervals for m, the corresponding confidence 

r 

coefficient being f s n -i(t)dt. Taking e. g. t f = — t p and t f, = / p , where 

v 

t p is the % value of t for n 一 1 d. of fr” we obtain the confidence 
limits 

corresponding to the confidence coefficient 1 — p/ 100 , or the confidence 
level p %. 


Consider the sample of n = 10 values from a supposedly normal population con¬ 
tained in the lasfc column of Table 31.3.7 The mean and the s. d. of the sample 
are respectively 1 68 and 1.167. Hence we obtain according to the last rule the con 
Hdence limits 1.68 ± 0.889 tp for the unknown population mean m. For the coniidence 
level jp = 6 %, this gives the confidence interval 0.70 < m < 2 46, while for ~ 1 % 
the interval becomes 0.82 < m < 2 84 

Choosing ( and t ，f differently, we obtain other rules for calculating 
confidence intervals for m. Suppose, e. g., that an interval (a, b) is 
given in advance. We now draw a sample of n values from a normal 
population, and denote the observed sample point by x 0 , the sample 
mean by x 0 , and the s. d. by ^ 0 . From these particular values and 
s 0 , we further determine t’ and t f， such that 


: f 0 — 7 二 ? 二 ^ = a, x 0 — t' ^ 

V n —— 1 




h. 


Vu~l 

Like any other values of t f and 〆'，the values determined in this way 


correspond to a rule for calculating confidence intervals for m, and 
in the particular case of the sample x 0% this rule leads precisely to the 
given interval («, b). Solving the above equations for f and t” 、 we 


find that the corresponding confidence coefficient is 


(34.4.4) 


^,7-1 (i 0 -a)!, 0 

I s n -i(t)dt. 


When the sample * 0 is known, this quantity can be numerically cal¬ 
culated for any interval (a, b). Thus we may say that, wiih respect to 
the estimation of m by means of the sample charactci'iHics jt and s y the 


519 



34.4 


observed sample x 0 assigns to anif given interval {a, b) a confidence coef¬ 
ficient given by (34.4.4). *) 

However, it is necessary to note carefully the concrete meaning of 
the last proposition. We are not saying that there is a probability 
given by (34.4.4) that m falls between the given limits a and b. As 
already pointed out in 34.2, such a statement would have no sense 
except when m is a random variable. We do, in fact, only assert that 
there exists a rule for calculating confidence intervals for m，which in the 
particular case of the sample x Q would lead to the given intei'val (a, b) 
as confidence interval.^ and that this rule is associated with the confidence 
coefficient (34.4.4). 

Tn the case of the sample of n = 10 values from Table 31.3.7 considered above, 
w© thas find by meanb of Table 4 that the interval 0.5 < w < 2.6 has the confidence 
+a 

coefficient j ⑺ = 0 97. 

-a »t 

As in 34.2, it should be observed that the above system of con¬ 
fidence intervals and confidence coefficients for the estimation of m 
is not unique. If, e. g., we replace 5t and s by the median and the 
mean deviation of the sample, we shall obtain a different system of 
rules. 


Ex. 2. The difference between the means of two normal distributions. 
Let x nt and be two independent samples with the 

means x and y, and the 8. d：8 s t and ^ 2 . Suppose that these are drawn 
from normal populations with the means w?! and m 2 , and the 8. d:s a v 
and q respectively. We suppose that all four parameters are unknown, 
and that it is required to estimate the difference m l — wi, between 
the population means. This problem Has been much discussed in the 
literature (cf e. Bartlett, Ref. 55, 56; Behrens, Ref. 60; Fisher, Ref. 
105—109; Neyman, Ref. 167; Welch, Ref. 229). 

In 31.2, we have considered the question whether m l — m 8 differs 
significantly from zero, under the simplifying assumption that tfj and 
a t are equal. If, in the variable u defined by (31.2.1), we replace 
x — y by x — y 一 - (m, — w 2 ), and if we assume that a, = a t) the re¬ 
sulting variable will have Student’s distribution with —2 d. 

of fr. Hence we obtain, in the same way as in the preceding example, 
the confidence limits 


*) At this point, we possibly exceed the conceptual limits of the theory as given 
by Neyman (cf 167). The same remark applies to the corresponding part of 

Ex. 2. 


520 



34.4 


( 34 . 4 . 5 ) f-yt + 十 ”， 4 

V Wi n t (n, + w, — 2) 

for the unknown difference rn x — m t . Here we have to take t v with 
n x 4- — 2 d. of fr. — We now proceed to make some remarks on 

the general case when and a t may have any values. 

To any parametric point (tn,, a x , corresponds a joint distri¬ 
bution of the -+ w g variables .r, and y J% which are represented in a 
space R of n t + dimensions. Let now four constants Jc fy c, and 
c 2 be given, subject to the only condition that k l < k t . For any para¬ 
metric point, the relation 

壳 l < c i ~ : + c 9 ~~ 、左 a 

A i 

defines a set S of points (*, y) = (.r,, . ， rr”,，//,，.. .，in the space 

R. Since the random variables 

t= ^ and u^V^-\ 

s l 

are independent and distributed in Student’s distribution with and 
m, d. of fr. respectively, the probability that a sample point (*, y) be¬ 
longs to the set S is 

(34.4.6) ff s„ t -i (O^-i [u)dtdu % 


where the integral is extended over the domain defined by the relation 




< 


Kw, — l 


_c* 



<k t . 


The quantity J is independent of the parameters, and the set S cor¬ 
responds to the set -S(«) of 34.3. The relation which defines the $et 
S may be written in the equivalent form 

(34.4.7) + ^ + 

•、’ jj «^l 汉 2 *、’l 及 2 

For any fixed point (*, y), this relation defines a cylinder set 2(«,y) 
in the four-dimensional (wi,, m ，，（； ! ， or,)-space, the base of which is a 
strip bounded by two parallell lines in the (rn 1} wj-subspace. Thus ac¬ 
cording to 34.3 the set S(*,y) is a confidence region for mi and m s , 
with the confidence coefficient J given by (34.4.6). 

521 



34.4 


Every choice of the constants kiy C\ and ^ yields, according to 
(34.4.7), a confidence region for m x and m v the corresponding con¬ 
fidence coefficient being; given by (34.4.6). By an appropriate choice 
of the constants, we may render the confidence coefficient equal to 
any given value 1 一 e. 

As in the preceding example, we now suppose that an interval 
(a, 6) is given in advance, and that two samples x 0 and y 0 have been 
drawn. From the particular values x 0} y 0 , and d observed in these 
samples, we determine k v c x and c t such that 

^ Xq — 一 b ， Ag = 一 Vo 一 a. 

Like any other values of the constants, the values obtained in this 
waj correspond to a rule for determining confidence regions for m x 
and m g . Inserting these values of the constants in (34.4.7), we find 
that in the particular case of the samples x 0 and y 0 this rule leads 
to the region 

a < m 1 — m a < fe, 


while the domain of integration in the expression (34.4.6) of the con¬ 
fidence coefficient becomes 


(34.4.8) x 0 — y 0 — b < 


V n x — i 


< 务 ) 一 歹❶一 a. 

■ 1 


Thus there exists a rule for determining confidence regions for m 1 
and m % , which in the particular case of the samples x 0 and y 0 would lead 
to the region a < 一 m t < b, and this rule is associated with the con- 
Jidence coefficient J given by (34.4.6), where the integral is extended over 
the domain (34.4.8). 一 In the sense explained by this statement, we may 
say that the samples x 0 and y 0 assign the confidence coefficient J to the 
region a < — m t <b. 

Hence we may deduce a test of significance due to Behrens and 
Fisher (1. c.). Let two samples with the means x and y, and the s. d:s 
Si and s % be given, and let 0 be an angle such that 


_ Si — 

where 



522 





34.4 


Consider the integpral J in (34.4.6), extended over the domain 


< Bin 设一 w cos ❼ > d, 

and determine d such that J" = d, where d is a given number such 
that 0 < d < 1. For fixed d, the quantity d will be a function of 
Wj, w f and 0 y which may be numerically calculated when these quan¬ 
tities are known. Now if x — y >dr y the region m x S m t will according 
to the above have a confidence coefficient smaller than d. Similarly, 
if x — y< — dr, the region m x ^ m, will have a confidence coefficient 
smaller than 6. If d is sufficiently small, tbe means x and y are 
accordingly regarded as significantly different, as soon as \x — y\ > dr. 
Tables for the application of this test are available (cf Sukhatme, 
Ref. 223; Fisher-Yates, Ref. 262). 


Ex. 3. The mean of a finite population (cf 34 2, Er. 3). Suppose that we have 
» population consisting of » large, but finite number N of individuals, among which 
a certain character ar is distributed. For the mean, the variance, and other char¬ 
acteristics of x in the total population, w© use tbe ordinary notations m, a 1 , n t etc. 


Ifc ia required to estimate the unknown mean m of the population by means of tbe 
representative method (cf 25.7). Let us draw a random sample of n individuals 
without replacemtnt, and denote by cr = 2 无 Jn and »* = S (怎泰一 x)*/n the mean and 
fch® Tariftnce of the n observed sample values of x. We then have (cf t. g. Neyman, 
Ref. 160; Hftgstroem, Ref. 121 a) 


E (t) — m, 




= F. 




or* 

n 


EW - 


N 




-° ，(，，)= — r n v - [2»A> -6(n+l)(iV-l) + 

+ (nN — N — n — 1、(N — l)y s ] 

where y t — fijo* 一 3 is the coefficient of excess (cf 16.8) of the population. When 
w an d N —■ n ar® both large, x ia approximately normal, so that the variable 

(x — m) is approximately normal (0, <r). The formulae for the mean 




一 l)n 


■and the variance of s 9 may be written 


略 )』^^)[ 1 + 0 關， 

(a) = * f 1 + + 0 ( 去 )] . 

where q* = N nsViN — n\ If we assume that the excess y, of the population may 
be neglected, ifc now follows by means of (18.1.6) that for large N tbe variable q\o i 
has approximately the same mean and the same variance as a ^' distribution with 
N{n — 1)/(JV — n) d. of fr. Although in this case the exact distribution of s 2 or 
is not known, we may as a first approximation assume that the variable 

523 



34.4 




■m) 


V 


N — n t 

n^^Y ) 9 


r ^N — l)(n — 1) x — m 


has Students distribution (18.2.4), with n replaced by N> — 1)I(N — n). For the 
unknown population mean m, we then obtain as in Ex. 1 the p % confidence limits- 




additional mruamk. The following important papers bearing on the subjects treated 
in Chs 32-34 (particularly 32.3-4 and 38.3) have unfortunately been omitted from the 
List of References ： 

Doob, J. L” Probability and Statistics, Trans. Amer. Math. Soc” 36 (1934), p. 759. 

Doob, J. L., Statistical estimation, Trans. Amer. Math. Soc., 39 (1936), p. 410. 

Fr^chet, M., Sur Pextension de certaines Evaluations statistiques au cas de petits 
echantillons, Rev. Inst. Intern, de Statistique, 1943, p. 182. 


524 



Chapters 35—37. Tests of Significance, II. 


CHAPTER 35. 

General Theory of Testing Statistical Hypotheses. 1 ) 

35.1. The choice of a test of significance. 一 In the preliminary 
survey of problems of statistical inference given in Ch. 26, the intro¬ 
duction of a test of significance for a statistical hypothesis has been 
described (cf 26.2 and 26.4) in the following general terms: When it 
is required to test whether a set of sample values agree with a given 
hypothesis H 、we consider the distribution of the sample, and calculate 
some convenient measure D ^ 0 of the deviation of this distribution 
from the hypothetical distribution. By means of the sampling distribu¬ 
tion of I )， we then determine a critical value D 0 such that, if the 
hypothesis H is true, we have P(D > D 0 ) = where e is our level of 
significance chosen in advance. When, in an actual case, we find a 
deviation D > D 0y the hypothesis H is rejected 、whereas the appearance 
of a value 1) ^ D 0 is regarded as consistent with the hypothesis, 
which is then accepted. 

By adopting this rule of behaviour, we have a probability equal to 
e of committing the error of rejecting ^ in a case when, in fact, it 
is true. Since e may be arbitrarily chosen, this probability may be re¬ 
duced to any desired amount. 

The general principle thus described, which lies behind all the par¬ 
ticular tests discussed in Chs 30 — 31, has certainly a strong appeal 
to intuition. On the given hypothesis, the occurrence of a very large 
deviation D has a very small probability. If, in an actual case，such 
a deviation presents itself, we feel naturally inclined to consider the 
hypothesis as disproved by experience. The appearance of some devi¬ 
ation D of moderate size, on the other hand, seems to be exactly the 
kind of event that ought to be expected, if the hypothesis is true. 

However, let us examine the principle a little more closely. Assume, 
e. g.，that D has a continuous distribution, with a frequency curve of 


*) Cf footnote p. 473. 


525 




35.1 


a type similar to the / distribution for w > 2 (cf Fig. 19, p. 235). 
It is true that, on the hypothesis H, the probability of a large devia¬ 
tion, say D > D 0 , is small. In fact, this probability is equal to the 
area of the tail of the frequency curve situated to the right of an 
ordinate through the point JD 0 , and we can always determine D 0 such 
that this becomes equal to any given « > 0. But it is equally true 
that the appearance of a very small deviation, say D < D,, also has a 
small probability, since we can determine Z), such that the area of the 
tail to the left of an ordinate through B x is equal to s. If we agree 
to reject H whenever D < D iy and otherwise accept, we shall thus 
still have the same probability e of rejecting if in a case when it is 
true. 一 More generally, we may in inBnitelj many ways choose a set 
of points S such that, if H is true, the probability that 2) takes a 
value in S is P{D < S) = e. Consider, for any such S, the test which 
consists in rejecting H whenever D takes a value in S, and otherwise 
accepting. The probability of unjustly rejecting 2T in a case when it 
is true will then always be equal to so that from this point of view 
the tests based on various possible sets S will all be equivalent. It is 
likely that our intuitive feeling will be in favour of the D 0 test, where 
the set S is the interval of large deviations D > D 0} and definitely 
opposed to the 1 ), test, where we reject the hypothesis precisely in 
those cases where the deviations are small. However, can we advance 
any rational arguments in support of the view that some particular 
form of the set S should be preferred to other possible forms? 

As an example, we may consider the x* test. In Ch. 30, we have denoted an 
observed Tftliie of % % a8 significant, when it exceeds tbe p = 100 e % value /J. This 
evidently corresponds to the test mentioned above, and the set S is here the 
interyal x* > Xj* When the number n of d. of fr. in large, ^2 x 9 may be regarded 
as normal (V 2 n—1, 1), and the same set S is tben approximately represented by the 
interval V2% x > ^2n—l + or x % > i (^2 n—1 + where A Jp is the 2p% 

value of & normal deviate, thns making the area of the right tail of the approximating 
normal curve equal to p/100 ― e. 一 However, in the latter case it would also seem 
reasonable to take account of both tails of the normal curve, thas counting 艺 * 霾 a 
significADt, when | 2— ^2 n—11 > A In this case, the set S would be composed 

ot the two intervals J (V2 n—l 一 Ap)* and 无 ’ > J(V 2 n—1 + Xp) f . In both coses, 

the probability of an nnjnst rejection of the hypothesia tested will be e. 

Further, the deviation measure D is by no means uniquely determined. 
We may, e. g. t measure tbe goodness of fit of a hypothetical distribu¬ 
tion to a sample by by to 9 , etc. Similarly the deviation of a nor- 

526 



35 . 


mai sample from the hypothesis that the population mean is equal to 
m may be measured e. g. by |^ — m| or by \g — m\ } where x and z 
are the mean and the median of the sample, etc. For any alternative 
deviation measure J, we maj in infinitely many ways find a set of 
points *S such that, if H is true, we have P(z/C 2) = e. The test 
which consists in rejecting H whenever J takes a value belonging to 
2 , and otherwise accepting, will still correspond to the given prob¬ 
ability € of rejecting H when it is true. 

Obviously it will be an important problem to find some rational 
method of discriminating between the various possible tests for a given 
hypothesis. Will it be possible to assign a reasonable meaning to the 
statement that, of two tests corresponding- to the same value of e, 
one is »better* or »more efficient* than the other? 

Duriog recent years, much work has been devoted to this problem 
by J. Neyman, E. S. Pearson and their followers. The reader is referred 
to a series of fundamental papers (Eef. 170 一 173) by Nevman and 
Pearson, and to a general exposition of the theory by Neyman (Ref. 
168)，where numerous references to the literature will be found. 

The basic idea of the Neyman-Pearson theory may be briefly de¬ 
scribed in the following way. When a test of significance is applied 
in practice, there are in each case two possible alternatives: we may 
decide to reject or to accept the proposed hypothesis H, and then act 
according to our decision 1 ). In either case our decision may be wrong, 
since we may reject H in a, case when, in fact, it is true, and accept 
it in a case when it is false.*) It now seems a perfectly reasonable 
principle that, in choosing a test, we should try to reduce the chances 
of committing both these kinds of errors as much as possible. 

In order that a test of the hypothesis H should be judged to be »good» f 
we should accordingly require that the test has a small probability of 
rejecting H when this hypothesis is true^ but a large probability of 
rejecting H when it is false. Of two tests corresponding to the same 


J ) There is, of course, also the third alternative that we may decide to remain in 
doubt and postpone action until further data have been collected. However, we con¬ 
sider here the case when such data arc already available, and the course of action 
mii8t be decided. 

*) This double possibility of error distinguishes the present situation from the 
one arising in the theory of estimation When we assert, e.g, that the unknown 
value of a certain parameter belongs to such and such confidence interval, our state¬ 
ment may be right or wrong, but there is only one way of committing an error, viz. 
by indienting an interval which, in fact, does not contain tbe parameter. 

527 



35.1-2 


probability e of rejecting H when it is true, we should thus prefer the 
one that gives the largest probability of rejecting H when it is false. 

We now proceed to show some applications of the general principle. 
It will be necessary to restrict ourselves to a very brief account of 
some of the most elementary features of this important theory, which 
is still in fall development. 

35.2. Simple and composite hypotheses. — Consider n random 
variables x ly ..x n , with a joint distribution in R n of the continuous 
type, defined by a pr. f. P(5; a) = P(5; a*) of known mathema¬ 

tical form, containing k unknown parameters a,, ..a*, or by the 
corresponding fr. f. /(*; a) = z f(x ly . .x n ； o 1} …， at). 

When, in particular, the x t are independent variables all having 
the same distribution, we have the ordinary case of a sample of n 
values from this distribution. However, as pointed out in the analogous 
case considered in 32.8 (cf also 34.3), the above definitions cover also 
more general cases, such as e. g. the case when the xt consist of 
several independent samples from possibly unequal distributions. Even 
in the general case, we shall refer to the point x = (x it .. x r ) and 
the space R n as* the sample point and the sample space respectively. 
—The parameters aj will be represented by the parametric point 
« == (a lf . .., a*) in the parametric space Pk. 

Suppose now that a sample point x has been determined by a 
single performance of the random experiment corresponding to the 
combined variable (x! ， … ， x n ). The hypothesis that, in the distribution 
of this variable, the unknown parametric point a belongs to a given 
set of points o) in the parametric space P*，will be briefly denoted as 
the hypothesis H. When u> consists of one single point « 0 , we shall 
say that the hypothesis is simple, and otherwise that it is composite. 
Evidently & simple hypothesis specifies the distribution completely, 
while a composite hypothesis leaves it more or less undetermined. 

Every parametric point a that is regarded as a priori possible will 
be called an admissible paint 、corresponding to an admissible hypothesis. 
The set Q of all admissible points may coincide with the whole para¬ 
metric space Pi, but may also form only & part of P*. 

If the x i are a sample of independent values from a non-singular normal distri¬ 
bution, wh€rc no a priori information concerning the parameters m and a ii ATailable, 
th© set Si of admissible hypotheses consists of the balf-plane a > 0. The hypothesis 
that m 0 and a *= 1 is a simple hypothesis, while the hypothesis that m == 0, without 
specifying the value of a, is a composite hypothesis. 

528 



35.3 


35.3. Tests of simple hypotheses. Most powerful tests. 一 Suppose 
that it is required to test the simple hypothesis H 0 that the unknown 
parametric point a coincides with the given point a 0 . A test of this 
hypothesis will consist of a rule to reject whenever the observed 
point x belongs to a certain set S in H n , and otherwise to accept ff 0 . 
The set S will be called the critical set of the test, and the test based 
on the critical set S will often be briefly called the test S. 

When the critical set S has been fixed, the probability of rejecting 
H q is identical with the probability P(S; a) that the sample point 
belongs to the set S. This is a function of the Jc variables a, • … ， cu, 
which will be called the power function of the test. According to the 
general desideratum expressed in 35.1, we should endeavour to arrange 
the test so as to render the power function small when ff 0 is true 
(i. e. when a coincides with a 0 ), and large when H 0 is false (i. e. when 
a is any admissible point other than « 0 ). 

Since the x distribution is continuous, there are always an infinity 
of different sets S such that P(S; a 0 ) = e t where € is our level of 
significance given in advance. If any of these sets is chosen as the 
critical set of a test, the probability of rejecting H 0 when it is true 
will be equal to and we shall then briefly say that we are concerned 
with a test of level s. It is now required to find, from among all 
tests of level e } one that renders the probability of rejecting H 0 when 
it is false as large as possible, i. e. one that renders the power func¬ 
tion P(S; a) as large as possible for any admissible a ^ a 0 . 

Let a, be a fixed admissible point ^ a 0 Since the values of a 
fr. f. may be arbitrarily changed over a set of measure zero, we may 
always suppose that f(x; a 0 ) and f(x; «,) are finite and determined 
for every For any c ^ 0, the set X of all points x such that 

(35.3.1) /(*; «,) S cf(x\ « 0 ) 

is then well determined. When c increases from 0 to oo , the function 
t// (c) = P (X; a 0 ) is never increasing. Further, %p(0) == 1, and we easily 
find 0 ^ ^ 1/c, so that xp (c) -> 0 as c — In order to avoid 

trivial complications, we shall assume 1 ) that there exists a value c such 
that %p(c)=^ e. For the corresponding set X we then have 

l ) There always exists a value c such that v(c—0) ^ e, V(c + 0 ) 筌 f. The excep¬ 
tional case when y>{c) does not actually assume the value e is included i» v the argu- 
m 60 t by means of a slight modification of tho definition of th© set X. In f&ct, 
y( c 一 0 〉一 y(c + 0) i8 the integral of f {x , a # ) over the set Z of all points * such that 
the sign of equality holds in (36.3.1). By excluding from the set X a conveniently 

529 



35.3 


(35.3.2) P(X\ « 0 ) = / /(*; « 0 )d* = «. 

.Y 

Let now 5 be the critical set of any teat of level £, so that 

(35.3.3) P(S\ «o) = / /(*； a 0 )dx = e. .、• 

We shall then show that 

(35 3.4) P(X; a,) ^ P(S\ a { ). 

Thus, among all tests of level the test X gives the largest possible valtu 
to the probability of rejecting H 0 when the alternative hypothesis H x that 
a = a { is true. Accordingly the test X will be called the most powerful 
test of H 0 with respect to H xt among all tests of level e. 

From (35.3.2) and (35.3.3) we obtain 

(35.3.5) P(X — SX; « 0 ) = « — P(SX; a 0 )= P(S— SX; a 0 ). 

From the definition (35.3.1) of the set X, it follows that for any * 
not belonging to X, we have cf(x\ a 0 ) >/(*； «i). Hence 

(35.3.6) P(X - SX; «,) ^ cP(X - SX;a 0 )= 

- cP(S-SX\a 0 )^ PiS-SX-.a,). 

Adding P(5X; «,) to the last inequality, we obtain (35.3.4). 

It may occur that we obtain the same set X for all admissible 
points «, ^ a 0 . In such a case we shall say that, among all tests of 
level e, the test X is the uniformly most powerful test of H 0 with 
respect to the whole set Q of admissible hypotheses. — When a uni¬ 
formly most powerful test exists, it seems fairly clear that it should 
be regarded as superior to any alternative test of the same level e. 
Unfortunately, this situation occurs but very rarely. 

Consider the case when the x t are n sample values from a distribution invoWing 
a single unknown parameter a, and suppose that there exist* a sufficient Ultimate 
«• of a. By 32.4, the joint fr. f./(x lf ... t x„, a) can then be writtfn in the form 
g («•； «) H (Xj, ..., * n ), where H is independent of a. When H > 0, (36.3.1) then 
takes the form g{a m ; u x ) ^cg(a* \ a 0 ). If certain general regularity conditions are 
satisfied, the set X will thaa b« a domain bounded by the hypersurface g («•; «|)— 

chosen subset of Z, we may always obtain a set satisfying (86.3.2). In all points of 
this modified set X, (85.3.1) is satisfied, while in all points of the complementary set 
we have /(*; a,) 5 c/(*; <*o). This is obviously sufficient to permit the conclusion 
(36.3.0). 


530 




35.3-4 


(« # , ce 0 X and this equatiou is equivalent to a certain number of equations of the 
form «* = const. If, for different alternative hypotheses cc^, we always obtain the 
same indi\ iduals of the family a* 工 const as bounding hypersurfaces of the set X, 
it thus follows that a uniformly most powerful test exists. However, it can be shown 
by examples (cf Neyman and Pearson, Ref. 173) that this property does not always 
hold. Thus even in this simple case、ve cannot, without imposing further conditions, 
assert the existence of a uniformly most powerful teat. Cf further NeymaD, Kef 166, 
where the question is brought into connection with the problem of the shortest 
confidence intervals mentioned in 34.2. 

A still simpler case, where the above developments provide a complete solution 
of the problem, is the case when only two alternative hypotheses exist. The joint 
fr f of the x t may then be written in the form Q —«) / 0 (*) + «/i (*)» where /。 anrt 
j\ are given fr. f s, and the admissible values of a are 0 and 1. The hypothesis H 0 
to be tested is the hypothesis that a = 0, i e. the hypothesis that the observed 
sample values are drawn from a distribution with the fr f f 0 , the only admissible 
alternative being f x . We then have to find the set X of all points * such that 

f x ^ c f 0 , where c is determined by th£ condition j f Q {x)dx = s. The test vybich 

X 

consists in rejecting H 0 whenever the observed sample point beloDgs to the set X, 
and otherwise accepting, the raost powerful test of level e. 一 This test may be 
applied eg. to problems of the follow mg type (cf Quensel and Essen-Moller, Ref. 
203): Suppose that we have measured certain characters x t m two human individuals 
A and B, and that it is required t() test the hypothesis that ^4 is the father of B. 

If、'e know the distributions of the j t among the children of persons having the 

characters shown by A, and among the general populatiou, say、、ith fr f 8 /* 0 aod f v 
respectively, the hypothesis implies that the sample values shown by B have b^en 
drawn from a distribution ^ith the fr. f / 0 , the alternative beingTins hypothesis 
Can be tested us shown above. 

A further example will be given in the follow mg paragraph. 


35.4. Unbiased tests. — We now restrict ourselves to the case of 
a single unknown parameter a. Let the admissible values of a form 
an interval A, and suppose that, for almost all x — (x lt . . ., x n ), the 
fr. f. f(x t a) has for all inner points a of 乂 a partial derivative 

—/i ( x > a ) such that |/i («; or) I < F(x), where F(x) is integrable 

over JR„. Then by 7.3 the derivative 

(35.4.1) f/Jxiajdx 

s 

exists for every set 5 in R„ and for every ci in A. 

Suppose that we are concerned with the simple hypothesis H 0 that 
a == a 0> where a 0 is an inner point of A, and let S denote the critical 
set of a test of level e. The power function P(S\ a) is then a func- 

531 



35.4 


tion of a, such that P(S; a 0 ) = *. If, for some admissible or, 9^ a 0l we 
have P(S; a t ) < e, this means that we are less likely to reject ff Q token 
the alternative hypothesis H x that a = cr x is true, than when H 0 itself is 
true. Obviously this must be regarded as an unfavourable property 
of the test, which is then called a biased test. 

When, on the other hand, P(S\ a) ^ £ for all admissible a t the 
test and the critical set S will be said to be unbiased. Since P(5;a 0 ) == 
and the derivative (35.4.1) exists for all a in it follows that we have 

(35.4.2) = 0 . 

In generalization of (35.3.1), we now consider the set X of ail 
points x such that 

(35.4.3) f(x : a,) ^c/(x; a 0 ) + <!,/,(*; a 0 ), 

where a, 7 ^ a 0 is a point of A, and where the constants c S 0 and e, 
are determined so as to satisfy the conditions 1 ) 

P(X;a 0 ) = ff(x\a 0 )dx^a, 

(35.4.4) x 

( ?I ^ 1 )rI fi(x;ao)dx=0 - 

X 


For the critical set 8 of any unbiased test of level e, we then have 
the relation (35.3.5), and from (35.4.2) and (35.4.4) obtain the ana¬ 
logous relation 


(dP{X- SX ]a )\ (dP(SX;a)\ ^ (dP(S-SX;g)' 

\ ) 0 \ da /o \ da 


In a similar way as in 35.3 we then obtain 


a,) ^ P(5; a,). 

It may occur that we obtain the same set X for all Admissible 
points a x or 0 . In such a case it follows that the test X is unbiaBed 

D By 瓢 similar argument as in the case of ( 96 . 3 . 2 ), we can ahow that this is 
alwayi possible, except in c«rUin exceptional cases, where we have to modify the defini¬ 
tion of the set X in the same way m indicated in the footnote p. 520 , i. e. by exelading 
from X » certain subset of the set Z of *11 points « such tbafc the sign of equality 
holds in ( 36 . 4 . 3 ). 


532 



35.4 

and gives, among all unbiased tests, the largest possible value to the 
probability of rejecting H 0 when any alternative hypothesis a = a, is 
true 1 ). The test X will then be called the most powerful unbiased test 
of H 0 . 

Consider the case of a sample of n values a^，.. . ，《 r n from a normal distribution 
with a known s. d. a, and an unknown mean m, and let it be required to test the 
hypothesis H 0 that m = m 0 . We shall first try to find the conditions for the exist¬ 
ence of a uniformly most powerful test, corresponding to a given level f. For any 
# m。the relation (36.8.1) takes the form 

(35.4.5) /(*."»,) 

/(*； W 0 ) _ 

where M = V~n (m, — m 0 )/o, A = Vn (x — m 0 ) ia. Suppose first that m, > m 0 . We then 
have M > 0, and if we take 

c = e ^h P -i«\ 

where p = 100 e and hp is the 2p % value of a normal deviate, the inequality (36.4.5) 
will be satisfied in the set X of all points x = (x" . • • ， xn) snch that A ^ 又卽 ， or 
m 0 + <r/Vn. Evidently this set is independent of m,, and the probability 

that x belongs to the set X, on the hypothesis H 0y is equal to p/100 = e, no that the 
condition (S5.8.2) is satisfied. Thus the teat based on the critical set X, which con¬ 
sists in rejecting H 0 whenever * g + 义冰 o/^n, is a uniformly most powerful test 

of H 0 with respect to the set of all alternative hypotheses such that m x > m 0 . 

For all m, < m 0 , we obtain in the same way the uniformly most powerful test 
based on the critical set X defined by x 彡 m 0 — A 吵 o/Vn. However, as soon as the 
set of admissible Alternatives includes rallies of m both to the right and to the left 
of the point m ot we no longer obtain the same set X for all admissible m v li 
follows that in this case no uniformly most powerful test existn. 

Consider the power function of the test based on the critical set x 迄 m 0 + 
Aap a I vn. The power function is equal to the probability that the sample point be¬ 
longs to this set, when the tra« mean is m, which is 1 — 0(z), where z = 又 2 p + 
Vn (m 0 -- m)/ o. This probability steadily inoreases with m, and for m — m 0 takes 
the value €. For m > m 0 the power function is thus > e, ao that we have a prob¬ 
ability > € of rejecting H 0 as soon as the true mean exceeds m 0 . When m < m 0 , on 
the other hand, the power function is < e, which means that the test is biased. 
The corresponding properties hold, of course, for the test based on the set vr ^ »? 0 — 
义 2p o/Vn. 

We now proceed to consider the best unbiased test, using the same level €= pf 100 
as before. The condition (86.4.3) takes here the form 

__ e ^-i«* ae + cWl 

*) This is a slight modification of a proposition due to Neyman and Pearson 
(Kef. 172). 


533 



35.4-5 


where c\ = c, ^n! a. We may always choose c and c x tuch thftt the sign of equality 
holds here when A = ± 又 p, and th« set X will then consist of all pointo * each that 
I 又 I 会 Ap, or I x — m 0 I ^ Xpa/Vn. Thia set evidently satisfies both conditions (36.4.4). 
Thus the ordinary test which consists in rejecting H 0 whenever the absolute deviation 
I x —m 0 | exceeds ^po/V 7i is the most powerful unbiated test of Jf 0 . 

The power function of this test is equal to {z') + 1 — 0 {/ f \ where z' — 
一 Ap + V n (m 0 — m) / a, z" «» A p -f V^n (m 0 — m) I a, while m is the true mean. It 
is easily seen that this function attains its minimum for m = m。，when it is eqnal 
to t For m ^ m ot the power function always exceeds e t and tends to 1 as m-* ± oo. 
According to the above, the graph of this power function lies entirely above the 
oorrespondiDg grap of any other unhia sed test The power function of the preceding 
test based on 云兰 m 0 + Xip a/^n is greater than the present power function for m>m 0 , 
but falls below it for m < m 0 , and even tends to zero as m —* 一 oo. 

In tbe ordinary tcsw based on the use of standard errors (cf 81.1), we assume 
that the variable z under investigation may, with a practically sufficient approxima¬ 
tion, be regarded as normally diutribnted with a known s. d. d{z). If, on the basis 
of one observed value of z t we are testing the hypothesis that tbe mean E ( 2 ) has 
some specified value z 0 , and if the circumstances of the problem permit us to restrict 
the set S2 of admissible alternatives e. g to the domain E (z) ^ z Q , tbe above shows 
that we should certainly use the test which consists in rejecting the hypothesis (on 
the 100 f -- p % level) when z z u Mp d (z). This situation will sometimes occur 
«n practiev, e g. when we are eoncerued with data relative to the effect of some 
method that will be very unlikely to impair, but may possibly improve, the quality 
of the thing produced. If, on the other hand, we are not prepared to introduce a 
priori a restriction of this »one-8ided» type, we should use tbe ordinary test based 
on the absolute deviation | z — z 0 J 

35.5. Tests of composite hypotheses. 一 As we proceed from simple 
to composite hypotheses, the theory becomes considerably more compli- 
cstted, and we shall have to restrict ourselves to some brief remarks, 
referring for further information to the original papers quoted in 
35 1 

Usin<r the general notations introduced in 35.2, we consider the 
hypothesis 7/ that the unknown parametric point a belongs to a given 
set w which is a subset of the set Q of all admissible points. As in 
the case of a simple hypothesis, a test of H will consist of a rule to 
reject If whenever the observed sample point x belongs to a certain 
critical set /S ’， and otherwise to accept H. According to the general 
desideratum of 35.1, we should try to find S so as to render the 
power function P(S, «) small when a belongs to w, and large when 
a ^elontjs to the set ^ — w of admissible alternatives. 

In some cases it is possible to find a set S — and even a family 
of sets 一 such that P(5, a) is constantly equal to any given level b 

534 



35.5 


for all a in at. We shall then *ay that S is similar to the sample 
space 1 ) with respect to the set co. The test i. e. the test based on 
the critical set 8 t then always gives the probability e for committing 
the error of rejecting H in a, case when it is true, whatever be the 
value of a in w, and we accordingly say that the test is of level e. 

Suppose now that it is possible to find a test X of level f such 
that, for any a belonging to Q — w and for any test S of level €, we 
have P(X; a) ^ P(S\ a). In analogy with 35.3 we shall then say that, 
among all tests of level £, the test X is the uniformly most powerful 
test of H with respect to the set 12 — w of alternative hypotheses. 

Similarly, if, for a test ^ of level £, we have P(5; cr) > ^ for all 
admissible a, the test will be called unbiased. A most powerful un¬ 
biased test is a test X of level £, such that P(X; a) ^ P(S\ «) for 
any a belonging to Q — w and for any unbiased test S of level e 

The general conditions under which these classes of tests exist, 
and the methods by which they may be found, are still very incom¬ 
pletely known. We shall only give some simple examples without 
proofs. 

Consider n sample values Xi，..., Xn from a normal distribution with unknown 
parameters m and cr. Let it be required to test the hypothesis that m = »w 0 , without 
specifying the value of a. In the (m, oO-plane, the set S2 of all admissible hy¬ 
potheses is (cf 35.2) the half-plane tr > 0, while the set o> consists of that part of 
the line m = m 0 that belongs to > 0. 

Let T denote any set of real numbers such that j 8n-i {t)dt = t, where Sn - 1 (t) 

is the fr. f. of Student s ratio t == V n—1 一 m 0 )/ 8, and let S denote the set of all 
points * in Rn such that the corresponding ratio t belongs to the 8«t T Then for 
liny a we have P(S, ni 0 , a) = P(t C T) — s, and it follows that the set 5 is similar 
to the sample space with respect to the given set (o. 

If the set Si — 0 ) of admissible alternatives is restricted to cases with m > m 0 , and 
if we choose for iS the set of all * such that t > hp, where p = 100 £, it can be 
shown (cf the papers quoted in 35 1) that the test S is uniformly most powerful. 
Similarly, with respect to any alternatives m < w 0 , the test based on the set t < 一 tip 
is uniformly most powerful. If the admissible alternatives include values of 川 both 
to the right and to the left of the point m 0l no uniformly moat powerful test exists, 
but the test which consists in rejecting H whenever \t \ > tp is the most powerful 
unbiased test of level e All this is analogous to the results proved in 36 4 for the 
case when <r is known. 

Tbc case of the difference between the means of two normal distributions has 
been investigated from the power function standpoint by Welch and Hsu (Ref. 229, 127). 


')The introdaction of tbis expression is due to the fact that the set S ^ Rn 
aatiaft«s the condition with « = 1. 


535 



35.5-36.1 


It appears from their work* that the test | u | > t p used in 31.2 and 81.8, Ex. 4, is 
only a satisfactory test of the hypothwis w! ** w* on the condition that U it known 
thftt <r t = a t . If the admissible hypotheses include cases with a x ^ a tt the test m»y 
be seriously biased. 


CHAPTER 36. 


Analysis of Variance. 

36.1. Variability of mean values. 一 The analysis of variance is a 
statistical technique introduced by R. A. Fisher (Ref. 13, 14) in con¬ 
nection with certain experimental designs applied in various branches 
of biological research work, especially in agriculture. The domain of 
applicability of this technique is, however, much wider, and it has 
already been successfully applied in many branches of experimental 
work. 

Suppose that an experiment has furnished the observed values 
x x , . . x n of certain variables, and that these can be regarded as 
independently drawn from normal distributions with a constant, though 
unknown s. d. a. The means rm of the distributions, on the other hand, 
may vary with certain factors entering into the experiment, such as 
different methods of treatment, different varieties of plants or animals, 
soil heterogeneity, etc. It is the purpose of the experiment to in¬ 
vestigate this variability of the means, and it may thus be required to 
test various hypotheses bearing on these quantities, such as the null 
hypothesis (cf 26.4) that differences in treatment or variety have no in¬ 
fluence on the means, etc. It may, of course, also be required to find 
estimates of certain means or functions of the means. 

On the general null hypothesis that all the Xi have the same mean, 
we know that the sum E (xt — x) 1 of squared deviations from the 
sampie mean, divided by the appropriate number of degrees of freedom 
(viz. n — 1), provides an unbiased estimate of the unknown variance a 9 . 
The basic idea of the analysis of variance consists in dividing up this 
sum of squares into several components, each corresponding to a real 
or suspected source of variation in the means. These components are 
arranged so as to provide tests for various hypotheses concerning the 
behaviour of the means, and estimates of various functions of the 
means in which we may be interested. 

In the next paragraph, we shall make a detailed study of the 

536 



36 . 1-2 


method in a simple particular case, and then proceed to more general 
cases. 

36.1 Simple grouping of variables. 一 Consider the simple case 
when the observed variables are arranged in r groups, the i:th group 
containing rtt variables, all of which are assumed to be normal (mi, a), 
where a isindependent of i. It is required to investigate the properties 
of the mi t and in the first place to test the null hypothesis that all 
the m t are equal, i. e. that there are no differences between the distribu¬ 
tions of the groups. 一 In the particular case r = 2, this problem 
reduces to the problem of the difference between two mean values 
already discussed in 31.2 and 34.4. 

Let x,j denote the ^:fch variable in the ?:th group, while 
1 ^ 

Xf. == — 2 x, j is the arithmetic mean of the variables in the t:th 
Ui 

1 r 〜 1 r 

group, and x = 2 2 ^ ~ 2 arithmetic mean of all 

t=i j 二 i t=i 

i 

m = 2 variables. We then have the identity 

2 ( 怎 IJ — 汁 = 2 (叫 一* •)* + 2 (无 ，•一 f )*， 

r 

where the sum is in each case extended over all n = 之 m variables. 

i 

Thus the total sum of squared deviafcionb from the general mean £ 
is the sum of two components, viz. 1) the sum of squared deviatioDS 
of each variable from the corresponding group mean (»sum of squares 
within groups*), and 2) the sum of squared deviations of group means 
from the general mean (»sum of squares between groups*). This 
identity bears an evident resemblance to the identity ( 21 . 9 . 1 ) used for 
the definition of the correlation ratio. 

Rewriting the same identity in a more explicit notation, and at 
the same time changing the order of the terms in the second member, 
we obtain 

(36.2.1) 2 2 — 2 ”*( 名 . 一无 )* + 21 ^ 一为 .)*， 

*=i j=i i=i i~i j=i 

or briefly 

Q = -f Q t . 

Then Q, Q x and are quadratic forms in the x t j, and we know (cf 

537 



36.2 


11.11 and 29.3) that Q may be orthogonally transformed into the form 

n— 1 

2 and consequently has the rank — 1. Further, Qi is the sum 

l 

of the squares of r linear forms L t = \Tn x (xt. — x) satisfying the 

r 

identity 芝 Vn t L x = 0, so that by 11.6 the rank of is ^ r 一 1. 

l 

Similarly is the sum of the squares of n linear forms Lij=Xij— Xi. 

n i 

satisfying the r independent relations 2 L tJ = 0, (e = 1, . . r), so 

户 1 

that the rank of is ^n — r. Now by 11.6 the rank of Q is at 
most equal to the sum of the ranks of Qi and Q t 、and it thus follows 
that the latter are exactly r — 1 and n 一 r respectively, so that we 
have the following rank relation corresponding to (36.2.1): 

n — 1 = r ■— 1 + n — r. 

Hence we conclude by 11.11 that there exists an orthogonal trans¬ 
formation replacing the n variables x t j by new variables y n 、 

such that the three terms of (36.2.1) are transformed into the cor¬ 
responding terms of the relation 

n —1 r-1 n —1 

2i^ , = 2i v，i + 2i 

1 1 r 

By hypothesis, the x t j are independent and normally distributed with 
a common 8. d. a, and consequently by 24.4 (cf also Ex. 16， p 319) 
the same holds true for the y,. Thus Q x and are independent. 

Let us now first assume that the null hypothesis is true, i. e. that 
wii == 7n for all i. Writing = m 4 专 ”、 the are independent and 
normal (0, o). Introducing this transformation into Q, Qx and and 
denoting by | f . and | the arithmetic means corresponding to %. and 
the three forms are transformed into the identical expressions with 
the letter x throughout replaced by The above orthogonal trans¬ 
formation replaces the by new variables rj x> . . , r] V) which are in¬ 
dependent and normal (0, or). Q x and Q s are hereby transformed 

w —1 r— 1 »— 1 

into 2 Vu 2 and 2 V* respectively. By 18.1 we then find that 

1 1 r 

<?/<?*, Qi^o 2 and Q t /a f are distributed in ^^distributions with n — 1, 
，.—— 1 and n — r d. of fr. respectively. Writing 

538 



36.2 


2( z “ — 汁， 

^1 j=l 


s ' =;rzri ^ = 73172 叫 ㈤ 一 卻， 

i=i 

4= ；^7.仏 = 土 2 i (叫—幻' 


^we thus have 


E (s 1 ) — E (^) = E (^}) = 0 s . 
The variance ratio e 2t = s]/sl may be written 


e a * 




<3i 


i2^ 






Since the ”, are independent and normal (0, a), the variable z has the 
distribution due to R. A. Fisher, defined by the fr. f. (18.3.5), where 
m and n have to be replaced by r — 1 and n — r respectively. In 
particular the mean and the s. d. of e 2t are given by (18.3.4). Tables 
of significance limits for e 11 and z f for various values of the signi¬ 
ficance level £=^p/100，are available (Fisher, Ref. 13; Fisher-Yates, 
Ref. 262; Snedecor, Ref. 35; Bonnier-Tedin, Ref. 8). The test» 
introduced by Fisher consists in rejecting the null hypothesis, on the 
p % level, whenever (^ | > where z p is determined bo as to render 
P(|^ I > ^p) — e = p/100. 

The null hypothesis is evidently a composite hypothesis (cf 35.2) 
concerning the parameters m X} . . tn f and a, viz. the hypothesis that 
the mi are all equal to an unspecified value m. Whatever the values 
of m and rr, the probability of rejecting the null hypothesis when it 
is true is P(\z\> z p ) = e. Thus the critical set corresponding to 
the z test is similar to the sample space, and the test is of level £, 
according to the definition of 35.5. 

It is customary to arrange the numerical values in a table of the 
following type : 


539 



36.2 


Variation 

Degrees of 
freedom 

Sum of squares 

Mean square 

Between groups .. 

r 一 

- 1 

Mh. — 

»=i 


= (?,/(，•一 1) 

Within groups .. 

n - 

-r 

<?.= i sVd), 

*—l j =i 


=QtKn — r) 

Total. 

n - 

-1 

<?= i -j ) 1 

8* 

= Q/(n — 1) 


Each of the three items under »Mean square 》 gives, on the null hypo¬ 
thesis, an unbiased estimate of the population variance a 2 , and the z 
test may be regarded as a test of the compatibility of the independent 
estimates given by s] and 4. 

We next proceed to consider the case when the null hypothesis is 
not true, i. e. when the group means m, are not all equal. Writing 
Xij = m t + ^ lJf the are independent and normal (0, a), and we have 

(x K — x) 9 = (I，. — I) 2 4 2 (m, — m) (|,. — |) f (m t — m) 2 , 

( x tJ — ^.) 2 = — I ，.) 2 , 

_ _ 1 r 
where f,. and | are defined as above, while m = - 2 w ； wi,. Introducing- 

l 

these expressions into 仏 and we find in the first place that Q t 
has the same distribution as in the case when the null hypothesis is. 
true. We further obtain (cf Irwin, Ref. 133) 

E (s?) = £ (; :上 i Qi\ = a* -f —2 n * ( m »* ~ ^) 2 ) 

' ^ l 

E ^= E (~rj.Q^o\ 

or 

E W — sl)j = ^2 ^(m< — m)*. 

The second member may be regarded as a measure of the variation 
among the unknown group means mi. The quantity (r— l)(^J —* 
which may be calculated from our data, thus gives an unbiased estimate 
of this measure. 


540 













Finalij, for any j^iven * ^ j the Tariable #<. — % ia normal 


36J 


WritdDg 


[mt — mj } a V(”< + rij)f(rnnj)]. 
£ t . — £ j . = (£ f . 一龙)一(勾•一 龙)， 


and observing that the above orthogfonal sabstitution replacing the 
Xtj by the yt changes every £i. — x into a linear combination of 
yi，• • • ， y”，we further see that % it independent of Q r It 
follows that the variable 


K nttij ft, — £j. 一 (mt 一 nij) 
nt + nj H t 


has Student’s distribation with n — r d. of fr. Working on • p % 
level, we thus obtain (cf 34.4) the confidence limits 

(36.2.2) 九 - 勾 .土 / 


for the difference m< 一 mj between the two unknown gpronp means. In 
the particular case when there are only two groups (r = 2), these limits 
are identical with the confidence limits given by (34.4.5). (Note the 
difference in notation with respect to 9 t i) When r > 2 we may, of 
course, also apply (34.4.5) to obtain confidence limits for nu — mj based 
only on the observations belonging to the ip-onps i and j. Howler, 
t p will then only have + n/ — 2 d. of fir., so that (36.2.2) with its 
n ― r d. of fr. will generally yield a smaller value of t P} i. e. a 
shorter confidence interval, for the same value of p. 

When the null hypothesis is true, the power function (cf 35.3 and 
35.5) of the e test assumes the value e. The behaviour of the power 
function when the null hypothesis is not true baa been investigated 
by Tang (Ref. 224), who has published tables for the numerical cal¬ 
culation of the function. These tables apply also to the more general 
cases considered in the following p&ra^rapha. 

The x t j »re n random variAblea, the joint distribatiou of which inTolre* the i* 十 1 
nnknown parameters m |t . • m r and a*. The joint fr. f. of the n varUbles is 






541 



36 . 2-3 


The problem of estimating the parameters by means of a sample consisting of one 
observed value of each x tJ is a case of the generalized estimation problem considered 
in 32.8. The relations E (x^) *= m f - and E (#}) « cr 1 show that the quantities x^ , •. 
and s| are unbiased estimates of the parameters. By means of the relation (32.4.1), 
duly generalized in the sense of 32 6 一 32.8, we And that these quantities are 
joint sufficient estimates. Further, by some calculation it will be found that the 

n — t 

joint efficiency of these estimates is — • 

36.3. Generalization. 1 ) 一 The preceding developments may be genera¬ 
lized to cases when the observed variables are arranged in a more 
complicated system of groups and subgroups of various orders. Gener¬ 
ally the variables will then be affected with two or even a greater 
number of subscripts, but for our present purpose it will be sufficient 
to retain the simple notation x lt . . x n of 36.1 for the variables. 
As before we suppose that the are independent, and that x x is nor¬ 
mal (mi, a), where a does not depend on i. 

For any grouping system used in a particular problem, we may 
then consider sums of squared deviations more or less analogous to 
the sums Q x and of the preceding paragraph, and it will often be 
possible to obtain in this way a relation of the same type as (36.2.1): 

n 

(36.3.1) 2 (*‘ — 汁 = 込 + 込 + + 沾， 

1 

where the Q v are sums of squares of certain linear forms in the 
such that we have the corresponding rank relation 

n — 1 = 4- r 2 -f + />, 

r，beings the rank of the quadratic form Q v . As in the preceding 
paragraph it then follows from 11.11 that there exists an orthogonal 
transformation chaDging into sums of respectively r lt ...，；> 

squares such that no two Q v contain a common variable yt. The 
yi being independent, it follows that Q 1} . . Qk are independent. 

Suppose now that it is required to test the hypothesis H that the 
unknown means m< satisfy certain linear equations. It will then often 
be po 明 ible to arrange the decomposition (36.3.1) of the total sum of 
squares in such a way that, if the hypothesis H is true, then two of 

')I have here made use of an unpublished manuscript kindly placed at my dis¬ 
posal by fil. kand. H. Andersson. For a discussion of the theory from similar, but 
more general points of view, cf Kolodziejcxyk, JRef. 140 a, and Taog, Ref. 224. 

542 



36^-4 


the forms Q ,、say and will redace to zero when all x t are re¬ 
placed by the corresponding m<. Thus in the ca«e considered in the 
preceding paragraph, the hypothesis tested is == wi, == • ••= mr, where 
the mt are the g^oap means, and if this hypothesis is trae, it is readily 
seen that Qj and Q t as defined by (36.2.1) both reduce to zero when 
the variables are replaced by their mean values. 

Assuming that H is true, and that the decomposition has been 
arranged according to the above, we substitute 而 = 十 ft into 
and Q t . When a non-negative quadratic form q(x x , . . x n ) is equal 
to zero in a point (m^ . . n7 n )> it. is easily seen that all derivatives 

— must also vanish in the same point. Thus we obtain identically 

OXi 

Q v (x if • . •， 办） = 仏 (fi ， ••” f») for y *= 1 and 2. The above orthogfonaL 
transformation will then change Q x into a sum of squares 2 Vtt 

l 

where the rji are independent and normal (0, a), and similarly for Q r 
Thus on the hypothesis H the variables Q x and Q t are independent 
and distributed in / distributions of r! and r, d. of fr. respectively. 
Introducing the mean squares 

= QJ。, »l = Qtlr„ 

and the variance ratio 

e 1 * = 

it then follows that E (^J) == E (4) — o s } while z has Fisher's distribu> 
tion given in 18.3. Thus the z test may be used to teat the hypo¬ 
thesis H. 

When the hypothesis H is not true, we may deduce similar results 
as in the preceding paragraph. 

36.4. Randomized blocks. 一 We now proceed to show the applica¬ 
tion of the above theory to some cases of great practical importance. 
We shall use a terminology referring to the agricultural applications, 
but the same experimental designs may be used in many branches of 
research work, in biology and elsewhere. 

Consider an agricultural experiment where we want to compare 
the effects of r different fertilissing treatments on the crop yield of 
some cereal. We then lay out s blocks of equal size on a piece of 
land. Each block is divided into r equal plots, among which the r 
different treatments are randomly distributed. Thus each block con* 

543 



36.4 


Thus iu and x.j are the sample means for the t:th treatment and the 
j.th. block respectively, while x is the general sample mean. — The 
identity 


2 2 ^ Xij . 


: 5 2 ( 办•— 歹)* + 


(36.4.1) 


+ ”2( 旬一龙)’ + 2 2(叫 


- X'i + x) % 


Qi + Qs + Qni 


and the corresponding rank relation 

r ^ — 1 = ; — L 4 - 6* 一 1 + (r — 1)(^ — 1) 

are then easily verified. Hence we infer by the preceding paragraph 
that Q u Q t and Q s are independent. 

Q x and Q t are known as the sums of squares due to variation 
»between treatments* and »between blocks» respectively, while for 
reasons that will appear below Q B is usually denoted as the »sum of 
squares due to error*. The numerical values may be arranged in 
tabular form in the same way as shown in 36.2. 

The variation in the mean values m t j will be due to soil hetero¬ 
geneity and to differences in treatment. Owing to the random ar¬ 
rangement of the treatments in each block, we may assume that the 
effects of soil heterogeneity within each block are included in the 
random part of the x”. Any difference between two m" belonging 
to the same block is then due to treatment, and we assume that 
niij=fi + bj y where /< only depends on the fertilizing treatment, while 
bj only depends on the block. We shall briefly call fi the »treatment 
effect， ， and bj the »block effect*. — Under these assumptions, it will 
be seen that Q 8 reduces to zero when the x" are replaced by their 
means, so that Q $ /a f has a 无 *.distribution with (r — 1)(5 — 1) d. of fr. 

544 


tains one plot of each treatment, and for each particular treatment 
we have s different plots. 

Let Xij denote the weight of the crop from the plot receiving the 
i:th treatment and belonging to the j:th block. We assume that the 
Xij are independent and normal (m^, a), and write 


i 

115 


OS 

1 jr 

II 

fj 






36.4 


Consequently the mean square s\ = QJ(r — 1)(^ — 1) gives an unbiased 
estimate of a*, which explains the above terminology. 

We now want to test the hypothesis H that there are no differ¬ 
ences between the fertilizing treatments. If H is true, we may take 
ft = 0 for all 2 , so that m, : = bj will only depend on the block number 
j. In this case, both and reduce to zero when the x t j are re¬ 
placed by their means. Introducing the mean square s\ = ^,/(r — 1), 
we may thus according to the preceding paragraph test H by applying 
the z test to the variance ratio = s 2 i/sl ， 

When the hypothesis H is not true, it is shown as in 36.2 that 

f" — i 

the quantity 口 W —4) gives an unbiased estimate of the variance 
1 r _ 

一 2 (/< 一 /)* among the unknown treatment effects. Further, for any 

l 

given i / j we obtain the confidence limits 

— ± t p s 

for the unknown difference /, —f, between the effects of the i th and 
j.th. treatments. Here t v is to be taken with (r — 1)(5 — 1) d. of fr. 

In a case where we have had to reject the hypothesis H } we may 
be interested in testing the further hypothesis H x that the inequality 
between the treatments is wholly due to one particular treatment, say 
the one corresponding to t = 1, while there are no differences bet¬ 
ween the others. If H x is true, we may take / g = = f r = 0, while /i 
is possibly different from zero. 

Let x (j r ) denote the pooled sample mean for the treatments 

2, … ， r: 

*=2 J = 1 

The sum of squares »between treatments > appearing in (36.4.1) 
may then be farther decomposed according to the identity 

Qi = ~~(xi. — x.a r))* + 5 2 ( 尤 . 一上 (2 ，))* 

1=2 

= Q\ + Ql 

which gives the rank relation 

，•一 1 = 1 十 r — 2. 

545 



36 . 4-5 


Q[ and Qi may be regarded as the sums of squares »between 落 roup 

1 and the pooled groups 2, . . r», and »between groups 2, . • ” r» 
respectively. Introducing this expression into (36.4.1) we find that, if 
the hypothesis H i is true, both Q\ and Q s reduce to zero when the 
Xij are replaced by their means. Introducing the mean square 
s'^ = Qi7(r — 2), we may thus test H x by applying tte z test to the 
variance ratio e 1 * = s^/sl. For the unknown treatment effect we 
obtain the confidence limitB 

负 一 o ± 

where as before t p has () — l)(s — 1) d. of fr. 

Further hypotheses of a similar kind concerning the properties of 
the treatment effects may be tested by analogous methods. The re¬ 
quisite identities will as a rule be easily found. 

36.5. Latin squares. 一 By the method of randomized blocks, we 
try to eliminate the effects of soil heterogeneity, so as to realize an 
unbiased comparison between the treatments (or varieties etc., as the 
case may be) dealt with in the experiment. An even more complete 
elimination is usually obtained by the method of Latin squares. 

Consider r* plots arranged in a square, and let r different fertili¬ 
zing treatments be applied to these plots in such a way that each 
treatment occurs once in each row, and also once in each column. 
Among the numerous possible arrangements satisfying these conditions, 
which are known as Latin squares 1 ), we suppose that one has been 
chosen at random for the experiment. Denote by x t j the weight of 
the crop from the plot in the *:th row and the ,;:th column, and let 
f u and x,j be the row and column means, while Xh is the mean for 
the plots receiving the h:th treatment, and x is the general mean. In 
this case we have the identity 

2 2 k — i ) 2 = ， 2 ( 办 — f ) 2 + ^2 ( 名 . 一外 + ， 2 (t — 汁 + 

* j * i j 

+ 2 2 一 心一为. —+ 2 tr ) 2 

― + §8 + Qs + <?4, 

where all sums are extended from 1 to r, while in each term of 
*) Tables of such arrangements are given in Fisher-Yates, Ref. 262. 

546 



36.5 


the sub 霧 cript /i should correspond to the treatment applied to the 
plot (t, j). The rank relation is here 

r*—l=r-~l -f r — 1 + r — 1 + (r—l)(r —2). 

We now assume that the mean value E(x tJ ) ==- m tJ consists of one 
»treatment effect» /a and another part due to soil heterogeneity, the 
latter being compo*ed of a »row effect* r, and a »column effect» c y 
We then have rriij = /> + r* + and as before we find that Q 4 has a 
/-distribution with (r 一 1) (r 一 2) d. of fr., so that the mean square 
«^4 = — 1) (r — 2) gives an unbiased estimate of the common va¬ 

riance 6 1 of the Xfj. The tabular arrangement of the data here takes 
the following form: 


Variation 

Degrees of 
freedom 

Sum of 
squares 

Mean square 

】ietween treatments .... 

r — 1 

■ 

*i — QiKf 一 1) 

Between rows. 

r 一 1 


*t = QtKr — 1) 

Between columns. 

T — 1 

Qz 

*1 = QAr - 1) 

Error. 

(r — 1) (r — 2) 


8] = QiKr - l)(r - 2) 

Total. 

卜 1 

Q 



The hypothesis that there is no difference between the fertilizings 
treatments may be tested by applying the z test to the variance ratio 
e 2x = In a case where this hypothesis has been rejected, we may 
estimate the variance among the treatment effects, and the difference 
between any two treatment effects, by the same methods as in the 
preceding paragraph. Further hypotheses concerning the properties of 
the fh may also be tested in the same way. 

We have here only been concerned with the simplest cases of the 
analysis of variance. For further information on the theory of ex¬ 
perimental designs, and for the generalization to the simultaneous 
analysis of several variables (»analysis of covariance*), we refer to 
books by R. A. Fisher (Ref. 13, 14), Snedecor (Ref. 35) and Bonnier- 
Tedin (Ref. 8). 


547 











37.1 


CHAPTER 37. 

Some Regression Problems. 

37.1. Problems involving non •random variables. — In practical 
applications, we very often encounter problems where we are concerned 
with a random variable y, which depends on a certain number of 
non-random variables M x n - In economic and social statistics, the 

values of the x t will then as a rule simply occur as given non random 
quantities in our statistical data. In experimental work, on the other 
hand, the values of the x t may often be arbitrarily chosen by the ex¬ 
perimenter. In both cases, the will play the r61e of variable para¬ 
meters entering into the distribution of y 、and our statistical data 
will consist of a set of observed values of each corresponding to 
known values of the x,. Besides the known parameters the y distri¬ 
bution may, of course, also contain certain unknown parameters. 

Suppose, e. g., that we are investigating the relations between the 
quantity y of a commodity A consumed in a given market, and the 
prices jc,i of A itself and a certain number of other commodities. 

It may possibly seem legitimate to regard y as a random variable 
with a distribution determined by the prices x u . while the 

procedure by which the latter are generated will perhaps not seem 
to resemble a random experiment. The x t will then simply have to 
be taken as given quantities appearing in our data. 

Suppose, on the other hand, that we are concerned with the in¬ 
fluence upon the output y in a certain factory exerted by the quality 
of raw materials and the technical process employed* as characterized 
by the variables x lf . . x n . We may then deliberately choose various 
systems of values of the ,r t , and observe the corresponding values of y. 
As before, y will here be regarded as a random variable, the distri¬ 
bution of which contains the x x as parameters. 

The theory of mean square regression developed in Chs 21 and 23 
holds, with due modifications, even in the present case. Further, in 
the case when the dependent variable y is normally distributed with 
a mean value which is a linear function of the variables x u it has 
been shown by Fisher (Ref. 92, 97) and Bartlett (Ref. 54) that certain 
regression coefficients have sampling distributions analogous to those 
deduced in 29.8 and 29.12, which may form the basis of tests of 
significance in a similar way as shown in 31.3, Ex. 6 — 7. — Some 
of the results due to these authors will be discussed in 37.2 一 37.3. 


548 



37.2 


37.2. Simple regression. 一 A sample consisting of n observed 
pairs of values (x t , yj, • . (x n , y n ) is given. For the sample moments, 
we use the ordinary notations x, y, m w etc. introduced by (27.1.6) and 
(27.1.7). However, we suppose now that a; is a non-random variable 
while, for every fixed x y the random variable y is normally distributed, 
with the mean a + /?(x — 尤 ） and the s. d. a, where a, /? and a are 
unknown parameters not involving x. Thus the sample moments 
only involving the x h such as m i0 = etc., are not to be considered 
as random variables, but simply as given constants. On the other 
hand, all quantities depending on the y,.，such as y, m ot = s\ f m n = rs x 
etc.，are random variables. The yt are supposed to be independent. 

The maximum likelihood estimates (cf 33.2) of a, and a are found 
bj minimizing the joint fr. f. of the y,，which is 

/== (wr e ' 


It will be found that the estimates of a and ^ are the values of these 
parameters that render the sum of squares occurring in the exponent 
as small as possible. Hence we obtain the estimates 


= 


A ? •一 sfe — 勻(价 — J ) _ W || 

p ^ — 一 T 


while the maximum likelihood estimate of a is given by 

a* f = ^ 2 [的一 a* — 私一龙 )] s = d(l — 〆)• 


As linear functions of tbe y u the variables a* and are both nor 
mally distributed, and we obtain 

= («•) = 〆/«， 

五 (#*) = 芦， D * (^) ― o 2 /(.5? n ). 


We further have the identity 
(37.2.1) — ， ㈨ 一郝 = 

i 

n 

= 2 [y* — a — ^{Xi — x)] 5 — n(a' t — a) 2 — $]v (/^ + — /?)* 


549 



37.2 


The variables % ， .•• ， rj n , where 

— a — ^(xi — t), 

are independent and normal (0, a), and the two linear forms 


fi =" («* — «) = rj^ 

C, = s, V~« (<?* —/S) = ~~y^ 2 ( x ' — V'. 


obviously satisfy the orthogonality conditions (11.9 1). Writing the 
identity (37.2.1) in the form 


i 

we may thua apply Fisher's lemma (cf 29 2), and find that «' /?’ and 
a* are independent, and that n a* 9 /a* is distributed like x* with a -- 2 
(1. of fr. Consequently by 18.2 the variables 




and 


及 i 


K //-2 


G* 


have Student's distribution with n 一 2 d. of fr. With respect to the 
regression coefficient this result is formally identical with the result 
already obtained (cf 29.8.4) for the case when both variables are 
random, and the joint distribution is normal. 

Since s ly a*, [i* and a* are all known, we may use this result to 
test any hypothetical values of a and /?, and to deduce confidence 
limits for these parameters, in the same way as shown for the mean 
of an ordinary normal distribution in 31 2 and 34.4. In particular 
we find that the regression coefficient ^ differs significantly from zero 

on the p % level of significance, if |/J* | > where t v is taken 

q v ti — 9 

with w - 2 d. of fr. 1 

We may finally be interested in estimating the unknown ordinate 

X •= « 4 - p(X 一 x) 

of the regression line in any given point X. It will be found that 
the variable 


550 



37 . 2-3 



a* -r ^(X-a)- Y 
7 


has Student’s distribution with n — 2 d. of fr., so that the p % con¬ 
fidence limits for Y are 


(37.2.2) «• + ^(X-x)±t P 


37.3. Multiple regression. 一 We now proceed to the case of a 
random variable the mean of which is a linear ftinction of h non¬ 
random variables x u . . ” xt. Suppose that a sample of n independently 
observed points (y v , X\ Vi . . xirj is given, where 1 ， 2, • • ” w. For 
the sample moments, we use the notations introduced in 27.1 and 
29.9, writing e. g. in accordance with (29.9.2) 

1 n 

l,j — —2 ( Xlv 一 ^*) ( x j v 一 矛，)， （ 1 ， .7 = 1， • • •， 灸)， 

*二1 

and further, regarding y as a variable .x 0l 

1 M 

hj = 2 (…— 乡 ) (怎广一 A). 

»=i 

By L and L n we denote the determinant 

l\i • • • ^ik 

L . . 

Ikl • • • hk 

and its cofactors. We shall assume that L / 0. 

Suppose now that, for an、，fixed values of x 1} . . Xk, the random 
variable y is normally distributed, with the mean 

(37 3.1) E(y) = or + ^ (ar, - x x ) + + /?*( 办一办 )， 

and the s. d. a. The maximum likelihood estimates a* and p* 
(t = 1 ， ••• ，々 ）are found to be the values of a and the ^ that render 
the sum 

n 

21 [ y ，一《 — / Mw — 无 l ) 一 ， 一办(办，一办) ]* 


551 




37.3 


as small as possible. Hence we obtain the estimates 

(37.3.2) a* = y, i?* = 2 = 1 ， • • • ， 足’)， 

while the maximum likelihood estimate of a is given by 

(37.3.3) a** 2 [y •. — a * — 济 一无 i) 一 — (or*,. — Xi)Y = si.n . i 

»=i 

where si n i is the sample value of the residual variance (cf 23.3 
and 29.12) of y with respect to a:,, . . Xk. We shall suppose that 
this is positive, which means that the observed values of y cannot be 
exactly represented by a linear function of the x t . 

As linear functions of the ?/*, the variables a* and are normally 
distributed. By some calculation, we find the following mean values 
and second order central moments: 

則=«， E ( A .) = A 

(37.3 4) E (a* — a) 2 = E [(a* ― a) (^* — p t )] = 0, 

where ; = 1, . . , X* Hence we obtain in particular by (23.3.4) 




t.hi, 

n L 


a 3 

W . 23 k 


and analogous expressions for D* (/S?), 1 = 2,..., k. 

Further, it can be shown that the variable a* is independent of 
the variables «* and /?,*, and that v a* 2 /^ has a ^^distribution with 
n-— k — \ d. of fr. In the particular case when the matrix 

’ll • - l\k\ 

.I* is a diagonal matrix, this can be proved by straight- 

hi • • • hk 、 

forward generalization of the method used in the ppeceding paragraph. 
In fact, the expressions 1^(^ 一 «) and s, 一 p t ) are linear 

forms in the variables rj v == y%> 一 a 一 ^ (xi ¥ 一 ifj) 一 • 一 办 ( 約 * — 办)， 
ami when L is a diagonal matrix, these forms satisfy the orthogonality 
conditions for « = 1, . . A*, so that Fisher’s lemma can be applied in 

tlie same way as before. — In the general case, we must first replace 
the variables Zi by new variables x\ by means of an orthogonal trans- 

552 & 





37.3 


formation such that the moment matrix of the new yariables is a 
diagonal matrix (cf 22.6). Applying the contragredient transforma¬ 
tion (cf 11.7) to the fh, the proof is then completed as in the parti¬ 
cular case. 

It now follows that the variables 

Vn — k —— 1 -— = V n — k — \ -- » 

o . n k 


and the analogous yariables with 用 ， ...，/$，all have Student's distri¬ 
bution with n 一 k 一 Id. of fr. With respect to the /Sf, this result 
directly corresponds to (29.12.1). As before, we may now deduce 
tests of significance and confidence limits for the unknown parameters 
a and fit. 

We can also obtain a joint test for a set of hypothetical values 
of the regression coefficients From the expressions (37.3.4) 

of the moments of the normally distributed yariables /??， it follows 
(cf Ex. 15, p. 319) that the variable 

$ 2 w (床•一尽 ) ( 成—如 

haa a ^ a -distributioii with k d. of fr. Consequently the variable 

eQt= ” 2 k (用 一 片 )( 浓一 a ) 

is distributed like a variance ratio (cf 18.3 and 36.2) with k d. of fr. 
in the numerator, and n — 1c 一 1 d. of fr. in the denominator. When 
a set of hypothetical values ft, •••，/?* are given, the quantity e 2x can 
be calculated from our data, and we may thus use the tables of the 
^-distribution to test the proposed values of the 

Suppose, in particular, that it is required to test the hypothesis 
that all regresaion coefficients are zero: ft = •. = /J* = 0. From (37.3.2) 
and (37.3.3) we obtain after some calculation, using (23.5.2), and (23.5.3), 

k 

2 ■/ 尽 • A* = ^00 r 0(12 . t)> 

a # * = rf .12 » = ,00(1 — »))， 

553 


ft! 

I 

S 

t l 

. 23.12 

1 

I 

n 


A 
1 〆 
/« 


I-. 

l 

28 




37.3 


where r 0 (n k) is the multiple correlation coefficient between tbe 
sample values of y and (a? n . . .t*). It then follows from 18.3 

and 18.4 that r$( 12 . *) has a Beta-distribution with the fr. f. 

x\ - ^ - 1 - In this particular case, the above test is thus 

formally identical with the test based on the distribution (29.13 8) of 
tbe hypothesis that a multiple correlation coefficient differs signifi¬ 
cantly from zero. 

k 

For the ordinate a -f — x,) of the regression line in any 

i 

given point (X v . . X*) t we obtain in direct generalization of (37.2.2) 
the p % confidence limits 

(37.3.5) a* + $#( 兄一无） 土 


tv 


a* 


V n — k 






(x, — ^*») (Xj — Xj) t 


where t p is to be taken with n — k — l d. of fr. — We finally add 
three remarks which are important in many applications. 

1. Let us drop the assumption that y is normally distributed, and only suppose 
that, for any fixed values of the x r the mean value of y is given by tbe linear ex¬ 
pression (37.3.1), while the s. d. is always equal to a. Under these more general 
assumptions it can be shown that the estimates (37.3.2) of the parameters cr and 久 
are the best (i. e. those having the smallest variances) among all unbiased estimates 
that are linear functions of the observed y v . The variances and covariances of these 
estimates are still given by (37.3.4), while n o**/(« 一 k 一 1) gives an unbiased estimate 
of o*. Further, the best lineAr unbiased estimate of the ordinate of the regression 

L 

line in any given point (X u . • •， X k ) is a* + (X, — x t ), and the standard error of 

l 

tbis estimate is equal to the coefficient of t p in (37.3.5). 一 This is equivalent to a 
classical theorem on least squares due to Markoff and others. For a pi oof, we refer 
e. g. to Neyman and David (Ref. 169). 

2 . The variables x t considered in tbe present paragraph may be any variables, 
dependent or independent, subject to the sole condition that L ^ 0, which implies 
that the n points (x lv> . . x kv ) do not all lie on the same hyperplane in tbe Ar dimen- 
sional space of the x } . In particular all the x i moy be functions of a single independent 
variable x. Suppose e. g. that x x is a polynomial p t (x) of degree i in x. The above 
problem is then a problem in parabolic regression (cf 21 6). If the p t (x) satisfy the 
orthogonality conditions, ^hich in this case take the form indicated in 12 6, Ex. 
3, the matrix L considered above reduces to a diagonal matrix, and ail calculations 
arc considerably simplified, as we have seen in Ihe analogous case considered in 21.6. 

554 



37 . 3-4 


3. When the condition L ^ 0 is not sutisfled, the varUnc^s and covariniices 
of the become infinite or undetermined, as shown by (37.3.4). When L is very 
small without being actually equal to zero, the points x k% ) lie »almo8t» on 

the same hyperplane. In this case, very large coefficients, or coefflcientii which are 
the ratios between very small numbers, will appear in our formulae for confidence 
intervals etc. Small errors in tb« data or in the calculations, small deviations from 
normality etc. will then have a great influence, and particular caution must be re¬ 
commended. This phenomenon will easily present itself when the x t are strongly 
connected, as is often the cate e g. in economic data. The methods to be used for 
regression analysis with data of this kind have been much ditcu»««d, eapeciftlly in 
connection with problems of the type considered in the next paragraph. We refer 
t. g. to the comprehensive work of Schult* (Kef. 34), and to papers by Frisch (Ref. 
113, 114) and Wold (Ref. 247, 248). 

37.4. Further regression problems. 一 In certain applications of 
the theory of regression, e. g. in psychology and economics, we are 
concerned with a set of random variables x lt . . x m , which may be 
represented in the form 

= a,, m, 4 - + am Mn + r,, 

(3 * 7 . 4 . 1 ) ••• 

Otni 1 Wj ~h . + n ~f* Vjn, 

where m > w, while w l? . . w n , v l} . . are m + n uvcorrelated random 
variables, and A = A m n = \a t0 \ is a matrix of rank n. 

In the psychological factor analysis of human ability, the variables 
x v , . . represent the measurements of m given different abilities 
of a person, while w,, . . «„ are more or less »general* factors of 

intelligence, and v ly . . v m are »specific* factors, each associated with 
a particular ability. In these cases, the main problems are usually 
concerned with the possibility of representing a given set of variables 
x x in the form (37.4.1). and with the existence and number of the 
»general» factors 

In some economic problems，on the other band, there are theore¬ 
tical reasons to expect the variables concerned to satisfy certain linear 
(or approximately linear) relations. Often, however, these variables 
cannot be directly observed, owing to the appearance of »errors» or 
»disturbances»• Instead of the »systematic parts» of the above vari¬ 
ables Xi ： 

x\ = an u x + + 山 n u ，， 

between which there exist m — n linear relations, we can then only 
observe the variables xt themselves as given by (37.4.1) where, now, 
the Vi represent the »disturbances». Here, the main problems are con- 

555 



37.4 


nected with the estimation of the coefficients in the linear relations 
between the systematic parts of the x k . 

Problems of this kind are too intimately connected with particular 
fields of application to be fully discussed here. The psychological 
applications belonging to this order of ideas were first treated by 
Spearman. We refer in this connection e. g. to the surveys of the 
theory given by Spearman (Ref. 35 a) and by Thomson (Ref. 37 a). — 
The economic problems indicated above have given rise to the intro¬ 
duction of the confluence analysis of Frisch (Ref. 114), which has been 
further developed and brought into contact with sampling and estima¬ 
tion theory by Koopmans (Ref. 142), Reiersol (Ref. 207) and others. 

We shall here only deduce a simple property of the moment matrix 

of a set of variables x ly . . x m that may be represented in the 
form (37 4.1). Without restricting the generality, we may suppose 
that all variables x,, u 3 and v, are measured from their means, and 
that E (mJ) = 1 for all We further write E(v)) = <j) y and denote by 
2 the diagonal matrix formed with a,, . , as its diagonal elements. 

We then have by 22.6 

A = AA f + X 2 

If ail the a t are positive, the moment matrix A is of rank w, so that 
the distribution of the variables M x m is non-singular (cf 22.5). 

On the other hand, the matrix A A is 1 ) only of rank n. It follows 
that any minor of order ^ n + 1 of the moment matrix which does 
not contain any element of the main diagonal 2 ), is equal to zero. It 
immediately seen that the same property holds for the correlation matrix 
P = of the variables x u . . x m . 一 This theorem is due to Thur¬ 
ston e. 

Consider e. g. the particular case n= \ . (In the psychological 
applications, this is the case when there is only one »general* factor.) 
The correlation coefficients q corresponding to any four different sub¬ 
scripts h y i, j y k then satisfy the tetrad relation 

Qhj Qhk 

=Qhj Qik — QhkQtj = 0 . 

_ Q" Q>k 

l ) By 11.6, the rank is at most n, and it is easily seen that there is at least one 
non-zero minor of order n. 

*) It will be noted that minors satisfying these conditions only exist when 
2 n 4- 2 S w 


55G 



TABLE 1. 

The Normal Distribution (cf Ch. 17). 


*P(x) 

— oo 


<pt>y) (x) = (—l) v Hv^ <p{ x ), where Hv (x) is the Hermite polynomial of degree v (cf 
12.6). For negative values of x, the functions are calculated from the relations 
<P (—x) = 1 一少 (x )， ip(—x) — <p(x), <p^ v ) ( 一； r) = ( 一 ly ㈣*) (»)• 


X 

少 Cc) 

<P(^) 

<f' (x) 

(X) 

(x) 



广⑻ 

0.0 

0.50000 

0.89894 

一 0 00000 

— 0.89894 

-f O.ooooo 

+ 1.19688 

— 0 00000 

— 6.98418 

01 

0.589S8 

0.89696 

0.08970 

0.89298 

0.11M9 

1.167Q8 

0 59146 

5.77025 

0 2 

0 67926 

0.89104 

0.07821 

0.87540 

0.28160 

1.079W) 

0 94180 

1.14197 

5.17112 

0 8 

0 01791 

0 S8189 

0.11442 

0.84706 

0.88295 

1 61420 

4.22226 

0.4 

0.05542 

0.86827 

0.14781 

0.80985 

0.41886 

0.76070 

1.97770 

3.01241 

0.5 

0.69146 

0 85207 

0.17608 

0.26406 

0.48409 

0.55010 

2.21141 

1.64481 

06 

0 72675 

0 88822 

0.19998 

0 21826 

0.52788 

0.82809 

2 80517 

— 0.28237 

07 

0 76804 

0 81226 

0.21868 

0 16926 

0.54868 

+ O.OS871 

2.26012 

+ 1-11854 

0.8 

0.78814 

0 28969 

0 28175 

0.10429 

0.64694 

— 0 12168 

2 08800 

2 29382 

0.9 

0 81594 

0 26609 

0.28948 

— 0.05058 

0.52446 

0.82084 

1.80961 

3.28026 

1 0 

0 84184 

0 24197 

0.24197 

0 00000 

0 48894 

0 48894 

1.45182 

3 87158 

1.1 

0 80438 

0.21785 

0 28984 

+ 0.04576 

0.42896 

0. 60909 

1 04580 

4.19585 

1.2 

0 88498 

0.19419 

0 28802 

0.08544 

0.86862 

0.69255 

0 62801 

4 21084 

1.8 

0.9O32O 

0.17187 

0 22278 

0 11824 

0.29184 

0.78418 

— 0 21800 

3 94763 

14 

0.91924 

0 14978 

0. 20962 

0.14374 

0 21800 

0.78642 

+ 0.15897 

3.46958 

15 

0.93819 

0.12952 

0.19428 

0 16190 

0 14571 

0.70425 

0 47855 

2 81094 

1 6 

0.94520 

0 11092 

0 17747 

0.173()4 

0 07809 

0 »4405 

0.71818 

2.07125 

1.7 

0 95548 

0 09405 

0 15988 

0.17775 

-HO 01769 

0 56316 

0.88702 

1.80785 

1.8 

0 96407 

0 07896 

0 14211 

0 17685 

— 0 03411 

0.46915 

0.98090 

+ 0 58014 

1 9 

0.97128 

0.06562 

0.12467 

0 17126 

0.07605 

0 86928 

1.00583 

一 0 06467 

2 0 

0.07725 

0 05899 

0 10798 

0 16197 

0.10798 

0.26996 

0 97184 

0 69390 

2l 

0.98214 

0 04898 

0 09287 

0.14998 

0.18024 

0.17646 

0 89150 

0 98987 

2 2 

0 9S610 

0.03547 

0 07804 

0 13622 

0 14860 

0 09274 

0 77844 

1 24886 

2 8 

0.98928 

0 02888 

0.06515 

0.12152 

0.14920 

— 0.02141 

0 64604 

1 37883 

2.4 

0.99180 

0 02239 

0 05375 

0 10660 

0 14834 

+ 0.03t»28 

0 60642 

1 39654 

2.5 

0.99879 

0 01758 

0 04882 

0 09202 

0 14242 

0 07997 

0 86974 

1 32421 

26 

0.99584 

0.01358 

0 03582 

0 07824 

0 18279 

O.L1053 

0 24876 

1 18645 

2.7 

0 99658 

0 01042 

0 02814 

0 06555 

0 120V1 

0 12926 

0 13381 

1 00761 

2.8 

0.99744 

0 0079*2 

0.02216 

0 05414 

0.10727 

0.18793 

+ 0 04287 

0.80970 

29 

0 99818 

0 00595 

0.01726 

0.04411 

0.09889 

0.18850 

— 0 02810 

0.61102 

3.0 

0.99865 

0 00448 

0 01330 

0.08545 

0.07077 

0.KV296 

0 07977 

0.42546 

3l 

0.99908 

0 00827 

0.01013 

0.02813 

0.06694 

0 12818 

0 11895 

0 2A242 

3.2 

0 99981 

0.00288 

0 00768 

0.02208 

0.05528 

0.11066 

0.13319 

0.12712 

38 

0.99962 

0 00172 

0.00668 

0 01704 

0.04485 

0.09690 

0 14086 

— 0 02130 

3.4 

0 99966 

0.00128 

0.00419 

0.01801 

0 08586 

0.08290 

0.18840 

+ 0 05607 

3.5 

0.99977 

0.00087 

0 00805 

0 00982 

0.02825 

0 06948 

0 18000 

0 10784 

3.6 

0.99984 

0 00061 

0.00220 

0 00782 

0.02194 

0 05708 

0.11765 

0.18802 

3.7 

0.99989 

0. 00042 

0.00157 

0. 00589 

0.01680 

0. 04699 

0.10297 

0.16102 

3.8 

0.99998 

0 00029 

O.oom 

0 00892 

0. 012(19 

0.08646 

0.08777 

0.16124 

3 9 

0.99995 

0.00020 

0.00077 

0.00282 

0.00946 

0. 02842 

0 07802 

0.14264 

4.0 

0.99997 

0.00018 

— 0 00054 

+ 0 00201 

— 0.00696 

+ 0 02181 

一 0 05942 

+ 0.12861 


V2n 




^dt, 


*P {x) = {x)- 


vTn 


z* 

~1 









































TABLE 2. 

The Normal Distribution (cf 17.2). 


The probability that %n obMrred value of a normally distributed variable ^ differs 
from the mean m in either direction by more than A times the standard deviation a is 

00 t t 

P = P(|^~m|>Acr)* 2[1 — ♦ (A)] == J e V 2 d t. 

k 

The value A = Ap corresponding to P « is called the p percent value of a normal 

deviate. 


X p as » function of p 

p as a function of Ap 

p = 100P 


Ap 

p = 100 P 

100 

O.oooo 

O.o 

100 ooo 

96 

0.0627 

0 2 

84.148 

90 

0 1257 

0.4 

68.916 

85 

0.1891 

06 

64 861 

80 

0.25B8 

0.8 

4 2 871 

76 

0.S1M 

1.0 

31.781 

70 

0.88M 

1.2 

23.014 

66 

0.45U 

1.4 

16.16 】 

60 

0.6244 

1.6 

10.W0 

66 

0.8978 

1.8 

7.186 

60 

0 6745 

2 0 

4 B50 

46 

0.7564 

2.2 

2.781 

40 

0.8416 

2.4 

1 640 

36 

0.9846 

2.6 

0.982 

80 

1.0M4 

2.8 

0.611 

26 

1.1608 

8o 

0.270 

20 

1.2816 

3.2 

0.187 

16 

1.4895 

8.4 

0.037 

10 

l.«449 

3.6 

0.082 

6 

1.9000 

3.8 

0.014 

1 

2.6758 

4.0 

0.006 

0.1 

3.2906 



0 oi 

3.8906 


! 


558 





TABLE 3. 

The ^-Distribution (cf 18.1). 


The fr. f. kn{x) of the x**di8lribntion with n degrees of freedom is defined by (18.1.3). 
The p percent value of / for n d. of fr. is a value such that the probability that 
an observed value of x* exceeds %p i* 

00 

P= 100 ==P(x，> ^ )== I knMix - 

By the kind permission of Prof. R. A. Fisher and Mcanrs Oliver and Boyd, the Uble 
ia reprinted from R. A. Fisher, Kef. IS. 


Degrees 

of 


Xp as a function of n and p = 100 P 


freedom 

« 

))=99 

98 

95 

90 

80 

70 

60 

30 

20 

10 6 

2 

1 

0.1 

1 

O.ooo 

0.001 

0.004 

0 016 

0.064 

0.148 

0.465 

1.074 

l.«42 

2.706 

3.841 

6.412 

6.635 

10.827 

2 

0.020 

0.040 

0.108 

0 211 

0.446 

0.718 

1.8M 

2.408 

3.219 

4.A06 

6.991 

7.824 

9.210(13.816 

3 

0.115 

0.186 

0.862 

0.684 

1 006 

1.424 

2 m 

3.66 & 

4.042 

6.251 

7.815 

9.837 

11 $4l|l6.2«8 

4 

0.297 

0.429 

0.711 

1.064 

1 649 

2.196 

3 857 

4.878 

5.989 

7.779 

9.488 

11.M8 

13 277)18.466 

5 

0.554 

0.752 

1.146 

1.610 

2.848 

3.000 

4.»61 

6.064 

7 289 

9.2M 

11.070 

13.888 

16 086 

20.517 

6 

0.872 

1.184 

1.685 

2.204 

3.070 

3.828 

6.MS 

7.281 

8.558 

10.645 

12.592 

15.088 

16 812 

22.467 

7 

1.289 

1.564 

2.187 

2.888 

3.822 

4.«71 

6.846 

S.S88 

9.808 

12.017 

14.067 

16.632 

18.475 

24.822 

8 

1.646 

2.082 

2.788 

3.490 

4.594 

6.527 

7.M4 

9.524* 

11.080 

13.862 

16.607 

18.168 

20 090 

26.126 

9 

2.088 

2.582 

3.826 

4.168 

6.M0 

6.89S 

8.84S 

10.666 

12.242 

14.684 

18.919 

19.679 

21.666 

27.877 

10 

2 658 

3.059 

3.940 

4.866 

0 179 

7.267 

9.842 

11.781 

13 442 

16 987 

18.807 

21.161 

23.209 

29.688 

11 

3.059 

3.009 

4.575 

6.578 

6.989 

8148 

10S41 

12.S99 

14.681 

17.276 

19.876 

22.618 

24.726 

31.264 

12 

3.571 

4.178 

6.226 

6.804 

7.807 

9.084 

11.M0 

14.011 

16.812 

18.649 

21.026 

24.064 

巧 .217 

32.909 

18 

4.107 

4.766 

5.892 

7.042 

8.684 

9.92« 

12.840 

16.119 

16.985 

19.812 

22.M2 

25.472 

27.688 

84.538 

14 

4.660 

6.868 

6.571 

7.790 

9.467 

10.821 

18.U9 

16.222 

18.151 

21 064 

28.086 

拥 .878 

20.141 

36.128 

16 

5.229 

6.985 

7.261 

8.547 

10.807 

11.721 

14.889 

17.822 

19.811 

22.807 

24.M6 

28.269 

80.578 

87.897 

16 

5.812 

6.814 

7.9«2 

9.312 

11.152 

12.624 

1&.88S 

18.418 

20.465 

23.542 

26.2M 

29 m 

32 ooo 

39.262 

17 

6.408 

7.255 

8.672 

10.086 

12.002 

13.681 

16.888 

19.511 

21 615 

24.789 

27.687 

30.995 

33.409 

40.790 

18 

7.016 

7.906 

9.890 

10.866 

12.867 

14.440 

17.8M 

20.601 

22 760 

25.989 

28.889 

32.S46 

34.805 

42.812 

19 

7.«S8 

8.M7 

10.117 

11.661 

13.718 

16.862 

18.U8 

21.689 

23.900 

27.204 

30.144 

33 887 

30 191 

4S.820 

20 

8.260 

9.287 

10.861 

12.448 

14.678 

10.2C6 

19.887 

22.775 

26.088 

28.412 

31.410 

35.020 

37.664 

46.S15 

21 

8.897 

9.915 

11.591 

13.240 

16.445 

17.182 

20.887 

23.868 

26.171 

29 01S 

32.671 

36.M8 

38.982146 797 

22. 

9.542 

10.600 

12.888 

14.041 

10.814 

18.101 

21.887 

24.989 

27.801 

30.818 

33.924 

37.659 

40.28948.268 

23 

10.196 

11.298 

13.091 

14.848 

17.187 

19.021 

22.887 

26.018 

28.429 

32.007 

35.172 

38.968 

41.688 

49.728 

24 

10.866 

11.992 

13.848 

15.659 

18.062 

10.948 

23.887 

27.096 

29.651 

33.196 

36.415 

40.270 

42.980 

51.179 

26 

11.524 

12.697 

14.611 

16.478 

18.940 

20.867 

24.U7 

28.172 

30.675 

34.882 

37.662 

41.666 

44.814 

62.620 

26 

12.198 

13.409 

15.379 

17.292 

19.820 

21.792 

26 N6 

29.246 

31.795 

35.668 '38.886 

42.856 

46.642 

64.052 

21 

12.879 

14.128 

16.151 

18.114 

20.708 

22.719 

26.816 

30.819 

32.912 

36.741 

40.118 

44.140 

46.968 

55.47ft 

28 

13.666 

14.847 

16.928 

18.939 

21 588 

23.847 

27.886 

31.891 

34.027 

37.916 

41.887 

46.419 

48.278 

66.893 

29 

14.266 

16.574 

17.708 

19.768 

22.476 

24.577 

28.886 

32.461 

35.189 

39.087 142.557 

46698 

49.58« 

)68 802 

30 

14.958 

16.806 

18.498 

20.699 

23.864 

26.508 

29.886 

33.680 

30.250 1 40.266 |43.773 

47.962 

50.892 

69.708 


559 





TABLE 4. 

The t-DisTRiBUTioN (cf 18.2). 

The fr. f. « n (x) of the ^-distribution with n degrees of freedom is defined by (18.2.4). 
The p percent value tp of t for n d. of fr. is n value such that the probability P 
that an observed value of t differs from zero in either direction by more than ip ia 

00 

■P*^Q = -P(IM>fy ) ==2 J 

l P 

By the kind permission of Prof. R. A. Fisher and Messrs Oliver and Boyd, the table 
is reprinted from R. A Fisher, Ref. 18 


Degrees 

of 




tp M A 

function of 

n and p — 

100 P 




freedom 

n 

jp = 90 

80 

70 

60 

60 

40 

30 

20 

10 

6 

2 

1 

0.1 

1 

0.158 

0.825 

0.610 

0.727 

1.000 

1.876 

1.9C8 

3.078 

6.814 

12.706 

I 

31.821 | 63.067 

636.619 

2 

0.142 

0.289 

0.446 

0.617 

0.816 

1.061 

1.886 

1.8M 

2.920 

4.808 

6.965 

9.925 

31.698 

3 

0.187 

0.277 

0.434 

0.584 

0.786 

0»78 

1.250 

1 688 

2.868 

3.182 

4.541 

5.841 

12.941 

4 

0.1M 

0.271 

0.414 

0.M9 

0.741 

0 041 

1.190 

1.588 

2.182 

2.776 

3.747 

4.604 

8.610 

5 

0.182 

0.2«7 

0.4O8 

0.569 

0.7J7 

0.920 

1.1W 

1.476 

2.016 

2.571 

3 866 

4.082 

6.869 

6 

0.181 

0.266 

0.4O4 

0.M8 

0.718 

0.906 

1 184 

1 440 

1.948 

2.447 

3.148 

8.707 

6.959 

7 

0.180 

0.2M 

0.4O2 

0.549 

0.711 

0.896 

1.11« 

1.415 

1.895 

2 866 

2.998 

3.499 

5.406 

8 

0 180 

0.262 

0.899 

0.M8 

0.706 

0.889 

1.108 

1.897 

1-WO 

2.806 

2.890 

3 8&& 

6.041 

9 

0.12V 

0.W1 

0.898 

0.648 

0.708 

0.888 

1.100 

1.888 

1 888 

2.262 

2.821 

3.250 

4.781 

10 

0.129 

0.260 

0.897 

0.642 

0.700 

0.879 

1.098 

1.872 

1.812 

2.228 

2.764 

3.169 

4.687 

11 

0.129 

0.W0 

0 890 

0.MO 

0.C97 

0.876 

1.088 

1.808 

1 796 

2 201 

2.718 

3.106 

4.487 

12 

0.128 

0.259 

0.M6 

0.680 

0.695 

0.878 

1.088 

1.858 

3.782 

2.179 

2.681 

3 055 

4.818 

IS 

0.128 

0 269 

0.894 

0.6S8 

0.AM 

0.870 

1.079 

1 S50 

1.771 

2.160 

2.660 

3 012 

4.221 

14 

0.128 

0.268 

0.898 

0.687 

0.692 

0.868 

1.076 

1.845 

1.7«1 

2 146 

2.824 

2.977 

4.140 

16 

0.128 

0.258 

0.898 

0.6M 

0.C91 

0.868 

1 074 

1.841 

1.758 

2.181 

2.804 

2.947 

4.078 

16 

0.128 

0.258 

0.892 

0.586 

0.C9O 

0.865 

1.071 

1.887 

1.746 

2.120 

2.588 

2.921 

4.016 

17 

0.128 

0.267 

0.S92 

0.684 

0 689 

0.8M 

1.069 

1.8d3 

1.740 

2.110 

2.8C7 

2.898 

3.9fl5 

18 

0.127 

0.257 

0.892 

0.684 

0.688 

0.862 

1.067 

1.880 

1 784 

2.101 

2.&&2 

2.878 

3.922 

19 

0127 

0.267 

O.m 

0.588 

0.688 

0 861 

1.006 

1.828 

1.729 

2 098 

2.689 

2.881 

3.888 

20 

0.127 

0.267 

O.m 

0.68S 

0.M7 

0.860 

1.064 

1.825 

1.726 

2.086 

2.528 

2.846 

3.860 

21 

0.127 

0.S57 

0.891 

0.682 

0.680 

0.860 

1.0M 

1.828 

1.721 

2.080 

2.618 

2.881 

3.819 

22 

0.127 

0.368 

0.M0 

0.M8 

0.686 

0.868 

1.061 

1.821 

1.717 

2.074 

2.508 

2.819 

8.792 

23 

0.127 

0.366 

0.890 

0.US 

0.685 

0.868 

1.060 

1.819 

1.714 

2.060 

2.500 

2 807 

3.767 

24 

0.127 

0.2M 

0.890 

0.681 

0.M6 

0.867 

1.059 

1.818 

1.711 

2.0A4 

2.402 

2.797 

3.745 

26 

0.127 

0.2M 

0.890 

0.M1 

0.684 

0.8M 

1.068 

1 816 

1.708 

2.060 

2.466 

2.787 

3.726 

26 

0.127 

0.2M 

0.800 

0.681 

0.584 

0.8M 

1.068 

1.815 

1.706 

2.056 

2.479 

2.779 

3.7V7 

27 

0.127 

0.2M 

0.889 

0.681 

0.084 

0.85A 

1.057 

1.814 

1.70B 

2.0&2 

2.478 

2.771 

3.690 

28 

0.127 

0.250 

0.189 

0.M0 

0.M8 

0.8^5 

1.056 

1.818 

1.701 

2.048 

2.«7 

2.768 

3.674 

29 

0.127 

0 2M 

0.889 

0.580 

0.68S 

0.864 

l.OW 

1.811 

1.699 

2.04 & 

2.462 

2.7M 

3.069 

80 

0.127 

0.260 

0.889 

0.680 

0.688 

0.8M 

1.0&6 

l.aio 

1.697 

2.043 

2.467 

2.750 

8.540 

40 

0.128 

0.26fi 

0.888 

0.629 

0.M1 

0.861 

1 060 

1.808 

1.084 

2.021 

2.428 

2.704 

3.661 

60 

0.186 

0.254 

0.887 

0.627 

0.879 

0 .撕 

1.04« 

1.296 

1.671 

2.000 

2 800 

2.000 

8.400 

120 

0.1M 

0.264 

0.8M 

0.626 

0.614 

0.677 

0.846 

1.041 

1.280 

1.668 

1.980 

2.8M 

2.617 

8.878 

00 

0.126 

0.258 

O.Mfi 

0.C74 

0.842 

1.0M 

1.282 

1.646 

1.960 

2.tM 

2.676 

8.201 


560 












List of References. 


I. Books. 

1. Aitken, A. C. Determinants and Matrices. University Mathematical Texts, 

I. Third ed., Edinburgh and London 1944. 

2. — Statistical Mathematics. University Mathematical Texts, 2. Third 

ed.，Edinburgh and London 1944. 

3. B6cher, M. Introduction to higher Algebra. New York 1908 (German 

ed. Leipzig 1910). 

4. Bochner, S. Vorlesungen liber Founersche Integrale. Leipzig 1922. 

5. Bohr, H., and Mollebup, J. La»rebog i matematisk Analyse, I 一 - IV. Second 

ed.，Kobenhavn 1938 — 1942. 

6. Borel, Lemons sur la throne des fonctions. Second ed., Paris 1914. 

7. Borel, and others. Traits du calcul des probability et de ses applica¬ 

tions. Pans, from 1924. 

8. Bonnier, G., and Tedin, O. Biologisk variationsanalys. Stockholm 1940. 

9. Charlier, C. V. L. Vorlesungen liber die Grundzuge der mathematischen 

Statistik. Lund 1931. 

9a. - Application de la throne des probability a rastronomie. Forms Vol. 

II, Part IV of Ref. 7. Paris 1931. 

10. Cramer, H. Sannolikhetskalkylen och n&gra av dess anvandmngar. Stock¬ 

holm 1927. 

11 . - Random Variables and Probability Distributions. Cambridge Tracts 

in Mathematics, No. 36. Cambridge 1937. 

12. Elderton,W. P. Frequency Curves and Correia(ion. Third ed.,Cambridge 1938. 

13. Fisher, R. A. Statistical Methods for Research Workers. Eighth ed” 

Edinburgh and London 1941. 

14. — The Design of Experiments. Second ed., Edinburgh and London 1937. 

15. Fr^chet, M. Recherches th^onqaes niodernes sur la th^orie ties proba- 

bilit^s. Forms Vol. I, Part III of Ref. 7. Pans 1937 — 1938. 

16. Gauss, C. F. Werke, Vol. 4, Gottingen 1880. 

17. Hobson, E. W. The Theory of Functions of » Real Variable, I—II. Cam¬ 

bridge 1926—-1927. 

18. Jefereys, H. Theory of Probability. Oxford 1939. 

19. Kendall, M. G. The Advanced Theory of Statistics, I. London 1943. 

-—— See Yulk, G. U. 

20. Keynes, J. M. A Treatise on Probability. London 1921. 

21. Kolmogohoff, A. Grundbegnffe der Wahrscheinlichkeitsrechnung. Ber 

hn 1933. 

22. Laplace, P. S. Throne analytique des probability. Paris, first ed. 1812, 

second ed. 1814, third ed. 1820. 

23. Lebesuue, H. Lemons sur rintegration et la recherche des fonctions primi- 

tive«. Second ed., Paris 1928. 


5()1 



24. L^vy, P. Calcul des probabilit^s. Paris 1925. 

25. - Th^orie de 1'addition des variables al^atoires. Paris 1937. 

20. Lundboro, H., and Linders, F. J. The Racial Characters of the Swedish 
Nation. Uppsala 1926. 

27. Mises, R. v. Wahrscheinhchkeitsrechnung und ihre Anwendung in der 

Statistik und theoretischen Physik. Leipzig — Wien 1931. 

28. - Wahrscheinlichkeit, Statistik und Wahrheit. Second ed.，Wien 1936. 

29. Moivre, A. de. Miscellanea Analytica. Second suppl. 1733. 

Mollerup, J.，see Bohr, H. 

30. Neyman, J. Lectures and Conferences on Mathematical Statistics. Wash 

mgton 1938. 

31. PoincarA, H. Calcul des probability. Second ed. t Paris 1912. 

32. Poisson, S. D. Recherches sur la probability des jugements etc. Paris 1837. 

33. Saks, S. Theory of the Integral. Second ed.，Warszawa 1937. 

34. Schultz, H. The Theory and Measurement of Demand. Chicago 1938. 

35. Snedecor, O. W. Statistical Methods. Ames, Iowa, 1940. 

35a. Spearman, C. The Abilities of Man. London 1927. 

36. Szego, G. Orthogonal Polynomials. New York 1939. 

Tedin, O., st>e Bonnier, G. 

37. Thiele, T. N. Theory of Observations. London 1903. 

37a. Thomson, G. H. The Factorial Analysis of Human Ability. London 1939. 

38. Titchmarsh, E. C. Introduction to the Theory of Fourier Integrals. Oxford 

1937. 

*)9. Uspensky, J. V. Introduction to. Mathematical Probability. New York 
1937. 

40. Valine Poussin, C. de la. Int^grales de Lebesgiie, fonctions d'ensembles, 

classes de Baire. Second ed., Pans 1934. 

41. Wiener, N. The Fourier Integral and certain of its Applications. Cam¬ 

bridge 1933. 

42. Wilks, S. S. The Theory of Statistical Inference. Ann Arbor 1937. 

43. Yule, G. U., and Kendall, M. G. An Introduction to the Theory of Sta¬ 

tistics. Twelfth ed., London 1940. 

II. Papers. 

Abbreviations. 

AE — Annals of Eugenics. 

AMS Annals of Mathematical Statistics. 

B Biometrika. 

CR Comptes Rendus de l'Academie des Sciences, Paris. 

JRS Journal of the Royal Statistical Society. 

MA Mathematische Annalen. 

M Metron 

POPS Proceedings of the Cambridge Philosophical Society. 

PRS Proceedings of the Royal Society, London, A. 

PTES Philosophical Transactions of the Royal Society, London, A. 
SA Skandinavisk Aktuanetidskrift. 

TAMS Transactions of the American Mathematical Society. 


562 



M). Aitkkn, A. C. On the graduation of data by the orthogonal polynomials 
of least »quares. Proc. R. Soc. Edinburgh, 53 (1933), p. 54. 

58. Arley, N. On the distribution of relative errors from a normal population 
of errors. K. Danske Vid. Selsk. Mat.-fya. Medd. 18 no. 3 (1940). 
Babinoton Smith, B., see Kendall, M. G. 

M. Bartlett, M. S. On the theory of statistical regression. Proc. R. Soc 
Edinburgh, 53 (1933) p. 260. 

55. - The information available in small samples. PCPS 32 (1936) p. 560. 

A6. - Complete simultaneous fiducial distributions. AMS 10 (1939) p. 129. 

- see Wish art, J. 

60. Behrens, W. U. Gin Beitrag zur Fehlerberechnung bei wenigon Beobacli- 

tungen. Landwirtsch. Jahrbucher, 68 (1929) p. 807. 

61. Berg ， G. The prognosis of open pulmonary tviberculosia. Acta Tuberculosea 

Scand., 8uppl. IV (1939). 

02. Bergstrom, H. On the central limit theorem. SA 1944, p. 139, and 1945, 
p. 106. 

63. Bernstein, S. Sur rextension du th^orume hmite du calcul de^ probabi¬ 

lity aux 8ommes de quantity d^pendantes. MA 97 (1927) p. 1. 

63a. Bortkiewicz, L. v. Das Gesetz der klemen Zahlen. I^eipzig 1898. 

64. Cantelli, F. P. La tendenza ad un hmtte nel senzo del calcolo delle prohu - 

bihUt. Rend. Circ. Mat. Palermo, 16 (1916) p. 191. 

65. Charlibr, C. V. L. Researches into the theory of probability. K. Fysiogr. 

S&llsk. Hand!. B 16 (1906). 

Chopper, C. J., see Pearson, E. S. 

66. Cochran, W. G. The distribution of quadratic forms in a normal system, 

with applications to the analysis of covariance. PCPS 30 (1933 — 34) 
p. 178. 

Corbkt, A. S., »ee Fisher, R. A. 

67. Craio, C. C. An application of Thiele’s nemi-invariants to the sampling 

problem. M 7 no. 4 (1928) p. 3. 

68. CramAr, H. Sur quelques points du calcul des probability. Proc. London 

Math. Soc., 23 (1925) p. lvui. 

69. - On some classes of senes used in mathematical statistics. Sixth Scan¬ 

dinavian Congr. oi Math., Kobenhavn 1925. 

70. - On the composition of elementary errors. SA 1928, p. 13 and p. 141. 

71. - On the representation of a function by certain Fourier integrals. TAMS 

40 (1939) p. 191. 

72. - Contributions to the theory of statistical estimation. SA 1946. 

74. Darmois, G. Sur les lois de probability egtimation exhaustive. CR 200 

(1935) p. 1265. 

David, F. N., see Neyman, J. 

Davies, O. L., see Pearson, E. S. 

75. Doob, J. L. Probability as measure. AMS, 12 (1941) p. 206. 

78. Duau^, D. Application des propri^t^s de la hmite au sons du calcul des 
probability h, l^tude de diverges questions d’eRtimation. Journ. de 
l’£c. Polytechn. 1937, p. 305. 

80. Edgeworth, F. Y. The law of error. PCPS 20 (1905) p. 3d 

503 



81. Eooenberoee, F. Die Wahrscheinlichkeitsansteckung. Mitt. d. Ver. 

schvveij&erischer Vers.-Math., 1924, p. 31. 

81a. Eneroth ， O. Om fromangden vid. fl^cksAdd samt om sarabandet mellan 
plan tan tal och slutenhetsgrad vid sjalvs&dd. NorrI. Skogsv.-forb Tidskr. 
1945, p. 161. 

82. Esscher, F. On graduation according to the method of least squares by 

means of certain polynomials. F6rsakr.-A.-B. Skan^ias Festskr. 1930, 
II, p. 107. 

83. Ksseex, C. G. Fourier analysis of distribution functions. A mathematical 

study of the Laplace-Gaussian law. Acta Math. 77 (1944) p. 1. 

Essen-Moller, E., see Quensel, C. E. 

84. Feller, W. Sur les axiomatiques du calcul des probability et leurs rela¬ 

tions avec les experiences. Actuaht^s scientifiques et industrielles, no. 
736 (1938) p. 7. 

85. - tJber den zentralen Grenzwertsatz (ier Wahrscheinlichkeitsrechnung. 

Math. Zeitschr. 40 (1935) p. 521. 

86. -一一 t)ber den zentralen Grenzwertsatz der Wahrschemlichkeitsrechnung, 

II. Math. Zeitschr. 42 (1937) p. 30i. 

87. Fisher, R. A. On an absolute criterion for fitting frequency curves. Mess. 

of Math” 41 (1912) p. 155. 

8B. - Frequency distribution of the values of the correlation coefficient in 

samples from an indefinitely large population. B, 10 (1915) p. 507. 

89. - On the mathematical foundations of theoretical statistics. PTRS, 222 

(1921) p. 309. 

90. - On the ^probable error* of a coefficient of correlation deduced from a 

small sample. M, 1 no. 4 (1921) p. 1. 

91. - — On the interpretation of from contingency tables, and the calculation 

of P. JRS, 85 (1922) p. 87. 

92. - The goodness of fit of regression formulae, and the distribution of re¬ 

gression coefficients. JRS, 85 (1922) p. 597. 

93. - The distribution of the partial correlation, coefficient. M, 3 (1924) 

p. 329. 

94. - On a distribution yielding the error functions of several well-known 

statistics. Proc. Intern. Math. Congr. Toronto 1924, p. 805. 

95. - The conditions under which % 2 measures the discrepancy between ob¬ 

servation and hypothesis. JRS, 87 (1924) p. 442. 

96. - Theory of statistical estimation. PCPS, 22 (1925) p. 700. 

97. - Applications of Student’s distribution. M, 5 no. 3 (1925) p. 90. 

98. - Tifie general sampling distribution of the multiple correlation coeffi¬ 

cient. PRS, 121 (1928) p. 654. 

99. - Moments and product moments of sampling distributions. Proc. Lon- 

tlon Math. Soc., 30 (1929) p. 199. 

100. - Inverse probability. PCPS, 26 (1930) p. 528. 

101. - The momenta of the distribution for normal samples of ineasureB of 

departure from normality. PRS，130 (1930) p. 16* 

102. Fisher, R. A. The concepts of inverse probability and fiducial probability 

referring to unknown parameters. PRS, 139 (1933) p. 343. 


564 




103. Fisher, R. A. Two new properties of mathematical likelihood. PRS, 

144 (1934) p. 285. 

104. - The logic of inductive inference. JRS, 98 (1935) p. 39. 

105. - The fiducial argument in statistical inference. AE, 6 (1935) p. 391. 

106. - On a point raised by M. S. Bartlett on fiducial probability. AE, 7 

(1937) p. 370. 

107. - The comparison of samples with possibly unequal variances. AE, 9 

(1939) p. 174. 

108. - A note on fiducial inference. AMS, 10 (1940) p. 383. 

109. - The asymptotic approach to Behrens* mtogral with further tables 

for the d test of significance. AE, 11 (1941). 

110. Fisher, R. A., and Tippett, L. H. C. Limiting forms of the frequency dis¬ 

tribution of the largest or smallest member of a sample. PCPS, 24 
(1928) p. 180. 

111. Fishf.r, R. A., Corbet, A. S., and Williams, C. B. The relation between 

the number of species and the number of individuals in a random 
sample of an animal population. J. of Animal Ecology, 12 (1943) p. 42. 

112. Fr^chet, M. Sur la convergence »en probabihte*. M, 8 no 4 (1930) p. 3 

113. Frisch, R. Correlation and scatter in statistical variables. Nord. Statist 

Tidskr , 8 (1929) p. 36. 

114. - Statistical confluence analysis by means of complete repression systems. 

Oslo 1934. 

115. Geary, R. C Distribution of Student's ratio for non normal samples. JHS, 

Suppl. 3 (1936). 

116. - A genera! expression for the moments of certain symmetrical functions 

of normal samples. B, 25 (1933) p. 184. 

116a. - Comparison of the concepts of efficiency and closeiu'ss for ronsistL*nt 

estimates of a parameter. B, 33 (1944) p. 123. 

117. Gnedenko, B. Sur les fonctions caraoteristiques. Bull, l^mv Moscou, A 1, 

no. 5 (1937). 

118. Gh\m, J. P. Om Raekkeudvikhn^er bestemte ved Hjaelp av de niindste 

Kvadraters Methode. Kobenhavn 1879 

119. Greenwood, M .， and Yule, G. U. An inquiry into the nature ot frequency 

distributions representative of multiple happenings with particulai 
reference to the occurrence of multiple attacks of disease or of repeated 
accidents. JRS, 83 (1920) p. 255. 

120. Gumbkl, E. J. Les valeurs extremes des distributions statistiques. Ann. 

Inst. Henri Poincare, 5 (1936) p. 115. 

121. Hagstroem, K. G. La loi de Pareto et la reassurance. SA, 1925 p 65. 

121a - Alcune formule appartenenti alia statistica rappresentativa. Giorn. 

1st. Italiano d. Attiiari, 3 (1932) p. 147. 

122. - Inkomstutjamnm^en i Sverige. Skand. Bankens Kvart.-skr.，Apul 

1944. 

Hald, A., see Rasch, (J. 

123. Haldane, J. B. 3- The mean and variance of X 2 t whjn used as a test of 

homogeneity, when exjjeetations are small. B，31 (1940) p. 346. 

124. Hartley, H. O. The range m normal samples. B，32 (1942) p. 334. 

- see Pearson, E. S. 


565 



125. Helmert, F. R. t)ber die Wahrscheiolichkoit von Potenuummen der 

Beobachtungsfehler etc. Z. f. Math. u. Phys. 21 (1876). 

126. Hotelling, H. The generalization of Student's ratio. AMS t 2 (1931) p. WO. 

127. Hsu, P. L. Contribution to the theory of Student's t-te«t as applied to the 

problem of two samples. Statist. Research Mem., S (l^M) p. 1. 

128. Hultkrantz, J. V. t)ber die Sunahme der Korpergroose in Sohweden in 

den Jahren 1840 — 1D26. Uppsala 1927. 

130. Ingham, A. E. An integral which occurs in statistics. PCP8, 29 (1932) p. 271. 

131. Irwin, J. O. The further theory of Francis G&ltcm’s individual difference 

problem. B, 17 (1925) p. 100. 

132. - On the frequency distrilmtion o£ the maans of samples from popula¬ 

tions of certain of Pearson's types. M, 8 no. 4 (1930) p. 51. 

133. - Mathematical theorem* involved in the analysis of variance. JRS, 94 

(1931) p. 284. 

134. Jordan, C. Approximation and graduation according to the principle of 

least squares by orthogonal polynomials. AMS, 3 (1932) p. 257. 

135. Kapteyn, J. C. Skew frequency curves in biology and statistios. Groningen 

1903 and 1916. 

136. Kendall, M. G. The conditions under which Sheppnrd’a corrections are 

valid. JRS, 101 (1938) p. 692. 

137. - and Babington Smith, B. Randomness and random sampling num¬ 

bers. JRS, 101 (1938) p. 147, and JRS Suppl., 6 (1939) p. 51. 

139. Khintchine, A. Sur la loi des grands nombres. CR, 188 (1929) p. 477. 

140. - Sul dominio di »ttrazione della legge di Gauss. Giorn. 1st. Italiano d. 

Atfcuan, 6 (1935) p. 378, 

140a. Kolodziejczyk, S. On an important class of statistical hypotheses. B, 27 
(1935) p. 161. 

141 . Koopman, B. O. On distributions admitting a sufficient statistic. TAMS, 

39 (1936) p. 399. 

142. Koopmans, T. Linear regression analysis of economic time series. Nether¬ 

lands Econ. Inst” Haarlem 1937. 

143. Kullback, S. An application of characteristic functions to the distribution 

problem o£ statistics. AMS, 5 (1934) p. 263. 

144. Lakqdon, W. H., and Ore, 0. Semi-invariants and Sheppard's correction. 

Annals of Math” 31 (1930) p. 230. 

145. LAvy, P. Propri6t6s asymptotiquee d®s sommes d© variables al4atoire« 

ind^pendantes ou enchain^es. Joum. Math, pures appl. y 14 (19W) 
p. 347. 

146. Liapounopf, A. Sur une proposition de la th^orie des probabilit^B. Bull. 

Acad. Sc. Sfc-P6terabo\irg, 13 (1900) p. 359. 

147. - Nouvelle forme du th^or^me *ur la limit© d© probability. M^m. Acad. 

Sc. St-P6tersbourg, 12 (1901) no. 5. 

148. Lindebebo, J. W. Eine neue Herleitung des Exponentialgesetzes in der 

Wahrscheinlichkeifc8rechnung. Math. Zeitschr., 15 (1022) p. 211. 

149. Lindquist, R. A treatise on reliable predictions of water conditions. Stock¬ 

holm 1932. 

150. Lukacs, E. A characterization of the normal distribution. AMS, 13 (1942) 

p. 91. 


566 



151. Lukomski ， J. On some properties of multidimensional distributions. AMS, 

10 (1039) p. 236. 

152. Lundberg ， O. On random processes and their application to sickness 

and accident statistics. Dissert. Stockholm, Uppsala 1940. 

154. Madow, W. G. Limiting distributions of quadratic and bilinear forms. AMS, 

11 (1940) p. 125. 

155. Mendel, G. Vereuche mit Pflanzenhybriden. Verhandl. naturforsch. Ver. 

Briinn, 4 (1865). 

156. Mises ， R. v. Grundlagen der Wahrschemlichkeitsrechnung. Math. Zeit- 


157. - 

schr., 4 (1919) p. 1. 

—— Deux nouveaux theoremes de limite dans le calcul des probability. 

158. — 

Revue Fac. Sc. Istanbul, 1 (1935) p. 61. 

—Les lois de probability pour les fonctions statiHtiqueH. 

Ann. Inst. Henri 

159. — 

Pomcar6, 6 (1936) p. 185. 

— On the foumlations of probability and statistics. 

AMS, 12 (1941) 


p. 191. 



159a. Newbold, E. Practical applications of the statistics of repeated events, 
particularly to industrial accidents. JRS, 90 (1927) p. 487. 

100. Neyman, J. (Contributions to the theory of small samples drawn from a 
finite population. B, 17 (1925) p 472 

161. - On the two different aspects of the representative method: the method 

of stratified sampling and the method of purposive selection. JRS, 
97 (1934) p. 558. 

162 - Su un teorema concernente le coaiddette statistiche sufficient*. (*ion» 

1st. Italian 。 d. Attuari, 6 (19^5) p. 320. 

163. - On the problem of confidence intervals. AMS, 6 (1935) p 111. 

164. - »Smooth test» for goodness of fit. SA, 1937, p. 149. 

165. - Outline of a theory of statistical estimation based on the cla&Hica! 

theory of probability. PTHS, 236 (1937) p. 333. 

166. - L’estimation statistique traitoomme un probleme cla^sique <ie 

probability. Actual it es scientifiquos et mdustnelles, no. 739 (1938) 
p. 25. 

167. - Fiducial argument and the theory of confidence intervals. B, 32 (1941) 

p. 128. 

168. - Basic idt*as and some recent results of the theory of testing statistical 

hypotheses. JRS, 105 (1942) p. 292. 

169. Neyman, J., and David, F. N. Extension of the Markoff theorem on least 

squares. Statist. Research. Mem., 2 (1938) p. 105. 

170. Neyman, J., and Peahson, E. S. On the use and mterpretatum oi certain 

test criteria for purposes of statistical inference. 20 A (1928) p. 
175 and p. 263. 

171. Xeyman, J., and Pkarson, E. S. On the problem of the most efficient tests 

of statistical hypotheses PTRS, 231 (1933) p 289. 

172. - -- Contributions to the theory of testing statistical hypotheses. Statist. 

Research Mem., 1 (1936) p. 1, and 2 (1938) p. 25. 

173. - Sufficient statistics and uniformly most powerful tests of statistical 

hypotheses. Statist. Research Mem., 1 (1936) p. 113 
Ork, O., see Langdon, \V H. 


567 



180. Pearson ， K. Contributions to the mathematical theory of evolution. 

PTRS, 185 (1894) p. 71. 

181. - Contributions etc., II: Skew variation in homogeneous material. PTR8, 

186 (1895) p. 343. 

182. --Contributions etc., IV: On the probable errors of frequency constants 

and on the influence of random selection on variation and correlation. 
PTRS, 191 (1808) p. 220. 

183. - On the criterion that a given system of deviations from the probable 

in the case of a correlated system of variables is such that it can be 
reasonably supposed to have arisen from random sampling. Phil. 
Mag” V， 50 (1900) p. 157. 

183a. - On lines and planes of closest fit to systems of points in space. Phil. 

Mag. VI, 2 (1901) p. 559. 

184. - On the systematic fitting of curves to observations and measure¬ 

ments. B, 1 (1902) p. 265, and 2 (1902) p. 1. 

185. - Researches on the mode of distribution of the constants of samples 

taken at random from a bivariate normal population. PRS, 112 (1926) 
p. 1. 

186. - On the probability that two independent distributions of frequency 

are really samples from the same population. B, 8 (1911) p. 250. 

187. Pearson, K., and Tocher, J. F. On criteria for the existence of differential 

death-rates. B, 11 (1916) p. 159. 

190. Pearson, E. S. The distribution of frequency constants in small samples 

from non-normal symmetrical and skew populations. B, 20 A (1928) 
p. 356. 

191. - The probability integral transformation for testing goodness of fit 

and combining Independent testa of significance. B, 30 (1938) p. 134. 
- See also Neyman, J. 

195. Pearson, E. S., and Clopper, C. J. The use of confidence or fiducial limits 
illustrated in the case of the binomial. B, 26 (1034) p. 404. 

136. Pearson, E. S., and Davies, O. L. Methods of estimating from samples 
the population standard deviation. JRS Suppl., 1 (1934) p. 76. 

197. Pearson, E. S., and Hartley, H. O. The probability integral of the range 

in samples of n observations from a normal population. B，32 (1942) 
p. 301. 

198. Pitman, E. J. G. The *closest» estimates of statistical parameters. PCPS, 

33 (1937) p. 212. 

199. - The estimation of the location and scale parameters of a continuous 

population of any given form. B, 30 (1939) p. 391. 

200. Quen8EL, C. E. The distributions of the second moment and of the correla¬ 

tion coefficient in samples from populations of type A. Kungl. Fysiogr. 
Sallsk. Handl., 49 (1938) no. 4. ~ 

201. - Inkomstfordelning och* skatt«tryck. Publ. by Sveriges Industrifor- 

bund, Stockholm 1944. 

202. - On the logarithmico - normal distribution. SA, 1945, p. 141. 

203. Quensel, C. E., and Essen-Moller, E. Zur Theorie des Vaterschattanach- 

weises auf Grund von Ahnlichkeitsbefunden. Zeitschr. f. gerichtl. 
Medizin, 31 (1939) p. 70. 


568 



205. Radon, J. Thoorie und Anwendung der abeolut additiven Mengenfunk- 

tionen. Sitzungsbcr. Akad. Wien，122 (1913) p. 1295. 

206. RiiACH’ G” and Hald, A. Nogle Anvendelser a( Transformationsmetoden i 

den normale Fordelings Teori. Festskrift til Prof. J. F. Steffensen, 
Kebenhavn 1943, p. 52. 

207. Reiersol, O. Confluence analysis by means of instrumental sets of variables. 

Arkiv for matematilc etc” 32 A (1945) no. 4. 

208. Romanovbky, V. Sur certaines esp^rances math^matiqaes et sur l’erreur 

moyenne du coefficient de correlation. CR, 180 (1925) p. 1897. 

209. - On the moments of standard deviations and o! correlation coefficient 

in samples from normal population. M, 5 no. 4 (1925) p. 3. 

210. - On the distribution of the regression coefficient in samples from nor¬ 

mal population. Bull. Acad. Sc. Leningrad, 20 (1920) p. 643. 

211. Sheppard f W. F. On the application of the theory of error to cases of normal 

distribution and normal correlation. PTRS, 192 (1898) p. 101. 

212. - On the calculation of the most probable values of frequency-constants 

for data arranged according to equidistant divisions of a scale. Proc. 
London Math. Soc., 29 (1898) p. 353. 

213. - The fit of a formula for discrepant observations. PTRS, 228 (1929). 

213a. Simonsen, W. On distributions of functions of samples from a normally 

distributed infinite population. SA, 1944, p. 235 ， 1945, p. 20. 

214. Slutsky, E. Vber stochastische Asymptoten und Grenz^erte. M, 5 no. 3 

(1925) p. 3. 

215. Smirnoff, N. Sur la distribution de w % . CR, 202 (1936) p. 449. 

210. Soper, H. E„ and others. On the distribution of the correlation coefficient 
in small samples. B, 11 (1917) p. 328. 

217. Steffensen, J. F. Factorial moments and discontinuous frequency-func¬ 

tions. SA, 1923, p. 73. 

218. - Free functions and the Student-Fisher theorem. SA, 1936 p. 108. 

220. Stieltjes, T. J. Extrait d’une lettre adressee k M. Hermite. Bull. Sc. 

Math., 2:e s6rie, 13 (1889) p. 170. 

221. »Student.» The probable error of a mean. B, 6 (1908) p. 1. 

222. - Probable error of a correlation coefficient. B, 6 (1908) p. 302. 

223. Sukhatme, P. V. On Fisher and Behrens’ test of significance for the dif¬ 

ference in means of two normal samples. Sankhvi, 4 (1938) p. 39. 

224. Tano, P. C. The power function of the analysis of variance tests with tables 

and illustrationfl of their use. Statist. Research Mem., 2 (1938) p. 126. 

225. Thompson, W. R. On a criterion for the rejection of observations and the 

distribution of the ratio of deviation to sample standard deviation. 
AMS, 6 (1935) p. 214. 

226. Tippett, L. H. C. On the extreme individuals and the range of sample® 

taken from a normal population. B, 17 (1925) p. 364. 

- see Fisher, R. A” 

Tocher, J. F., see Pearson, K. 

227. Tschuprow, A. A. On the mathematical expectation of the moments of 

frequency distributions. B, 12 (1919) p. 140 and 185, and B, 13 (1921) 
I). 283. 

227a. - Zur Theorie der Stabihtat statmtischer Reihen. SA, 1918 and 1919. 

569 



228. Wahlund, S. Demographic studies in the nomadic and the settled popula¬ 

tion of northern Lapland. Uppsala 1932. 

229. Welch, B. L. The significance of the difference between two means when 

the population variances are unequal. B, 29 (1938) p. 350. 

230. Wick8Ell, S. D. On the genetic theory of frequency. Arkiv for matematik 

etc., 12 (1917) no. 20. 

231. - Sex proportion and parental age. Kungl. Fysiogr. Sallsk. Handl” 

37 (1926) no. 0. 

232. Wilks, S. S. Certain generalizations in the analysis of variance. B, 24 

(1932) p. 471. 

233. - Shortest average confidence intervals from large samples. AMS, 9 

(1938) p. 106. 

234. - Fiducial distributions in fiducial inference. AMS, 9, (1938) p. 272. 

Williams, C. B., see Fisher, R. A. 

240. Wish art, J. The generalized product moment distribution in samples from 

a normal multivariate population. B, 20 A (1928) p. 32. 

241. Wish art, X, and Bartlett, M. S. The generalized product moment distri¬ 

bution in a normal system. PCPS, 29 (1932) p. 260. 

245. Wold, H. Sulla correzione di Sheppard. Giorn. 1st. Itahano d. Attuari, 

5 (1934) p. 304. 

246. - Sheppard’s correction formulae in several variables. SA, 1934, p. 248. 

246a. - A study in the analysis of stationary tune series. Disa. Stockholm, 

Uppsala 1938. 

247. - EftorfrAgan p& jordbruksprodukter och dess k^nslighet for pris. och 

inkomstforandringar. Statens Off. Utr., 1940 no. 16. 

248. - A theorem on regression coefficients obtained from successively ex¬ 

tended sets of variables. SA, 1945, p. 181. 

250. Yates, F. Contingency tables involving small numbers and the % % JRS, 

Suppl. 1 (1934) p. 217. 

251. Yule, G. U. On the theory of correlation for any number of variables treated 

by a new system of notation. PRS 79 (1907) p. 182. 

- see also Greenwood, M. 

III. Tables. 

280. British Association, Mathematical Tables, Vol. 7 ： Tables of the Proba¬ 
bility integral by W. F. Sheppard (1939). 

261. David, F. N. Tables of the Correlation Coefficient. London 1938. 

*262. Fisher, R. A., and Yates, F. Statistical Tables. Second ed., Edinburgh 
and London 1943. 

263. Kendall, M. O., and Babinoton Smith, B. Tables of Random Sampling 
Numbers. Tracts for computers, no. 24, 1940. 

204. Pearson, K. Tables for Statisticians and Biometricians. I, second ed” 

1924. II, 1931. 

205. - Tables of the Incomplete 厂 .function, 1922. 

266. - Tables of the Incomplete B-function. 

267. Tippett, L. H. C. Random Sampling Numbers. Tracts for computers, no. 

15, 1927. 


570 



Index. 


Addition theorem, 196, 206, 212, 234, 379 
Additive class of sets, 10, 30 
Age of parents, 467 
Almost everywhere, 40 
Analysis, 145, 836 
of variance, 242, 630 
of covariance, 647 
Area, 76 
Association, 443 
Asymptotically normal, 214 
Axiom, 146, 162, 163, 155 

Bayes' theorem, 607, 608 
Bernoulli 8 numbers, 123 
theorem, 196 

Beta distribution, 243, 252 
function 126 
Bias, 361, 386, 478 

Bienaym^-Tchebycheff s inequality, 183 
Binomial distribution, 】 93, 487 
Bolzano-Weierstrass' theorem, 12 
Borel sets, 13, 16, 30, 32, 33, 162 
Borel s lemma, 19 
Breadth of beans, 329, 440 

Cauchy's distribution, 246 ， 373, 490 
Central limit theorem, 213, 316 
Centre of gravity, 71, 178 
C. f., 185 

Change of variables, 202 
Characteristic function, 89 ， 100, 185, 265, 
296 

number, 112 

Chi-8<|uare (x 1 ) distribution, 233， 250 
immmum method, 425, 606 
teat, 416, 424 

Coefficient «f correlation, 277, 295, 369, 
308, 411 

Coefficient of correlation, multiple, 307, 413 
” » » partial. 306, 316, 


Coefficient of correlation, total, 306 
» » divergence, 447 

of excess, 184, 187, 356, 386 
of regression, 273, 302, 316, 402, 410 
of skewness, 184, 187, 366, 386 
of variation, 367 
Cofactor, 109 

Combined experiment, 156 
variable, 164 
Composition, 190 
Concentration, 179, 284, 300, 493 
('onfidence coefficient, 612 
interval, 462. 507, 600, 517 
level, 512 

limits, 462, 407, 612 
region, 607, 617 
Confluence analysis, 656 
Contingency table, 441 
Continuity interval, 58, 80 
theorem, 96, 102 
Continuous type, 169. 261. 292 
Convergence in probability, 262, 299 
Correlated variables, 278 
Correlation, 265, 277, 301 
ratio, 280 

Covariance, 263, 295 
Cumulants, 186 

Darboux sums, 34, 62, 72, 86 
Death rates, 440 
Decile, 181 

Degrees of freedom, 234, 230, 381 
De Moivre's theorem, 198 
Dependence, 282, 443 
Description ， 145, 335 
Determinant, 108 
D. f, 166 

Dirichlet*s integrals, 121 
Discrete mass point. 67, 81 
Discrete type, 168, 260, 201 
IJispersionTTTo 


411 


571 



Distance, 16 

Distribution, 66, 80, 168 
bimodal, 179 

conditional, 167, 164, 267, 292 
joint, 166, 164 
marginal, 82, 166, 260, 291 
multimodal, 179 
non-singular, 297 
of a sample, 826 
simultaneous, 166 
singular, 297 
truncated, 247 
unimodal, 17d 

DiBtribution function, 67, 80, 164, 106, 
260, 291 
D. of fr., 381 

Edgeworths'§ series, 228 
Efficiency, 481, 487 
asymptotic, 489, 494 
Element of a matrix, 103 
of a set^ 3 

Ellipse of concentration, 283, 493 
of inertia, 276 

Ellipsoid of concentration, 300, 496 
Equiprobability carves, 288 
Estimate, 320, 838, 47S 
asymptotically efficient, 489, 404 
consistent, 861, 489 
efficient, 477, 481, 487, 493, 496, 497 
regular, 479, 480 
sufficient, 488, 492, 497 
unbiased, 361 
Estimation, 473 

Euler-MacLaurin's sum formu)a r 123 
Euler's constant, 125 
Excess, 184, 230 
Expectation, 170 
Extreme values, 870 

Factor analysis, 655 
Fisher s lemma, 379 
z-distributioo, 241 
Form, bilinear, 107 
definite positive, 114 
non-negative, 114 
quadratic, 107 
reciprocal, 111 
semi-deHnite, 114 


Fourier integroU, 88 
Frequency curve, 170 
function, 67, 82 ， 167, 292 
interpretation, 149, 332 
ratio, 142 
Fr. f., 107 

Fanctioo, 及 measurable, 67, 85 
characteristic, 80, 100, 】 85, 266, 296 
complex-valued, 66 
generating, 267 
integrable, 86, 87, 43, 46, 85 

Gamma function, 125 
Generating function, 257 
Gram-Charlier's series, 222, 268 
Grouping, 328, 360 

Hair colours, 446 
Hellioger's integral, 487 
Hermite's polynomials, 133, 222 
Histogram, 829 
Hyperplane, 10 
Hypersurface, 10 
Hypothesis, admissible, 628 
composite, 628, 634 
simple, 628 

Income distribatious, 220, 248, 444, 448* 
Independent events, 160 
repetitions, 101 
trials, 206 

variables, 169, 164, 187, 188, 285, 316. 
Inner point, 12, 10 
Interval, 10, 11, 16 

Khintchine's theorem, 263 

Laguerre's polynomials, 133 
Laplace's distribution, 247, 378 
Latin squares, 646 
Least squares, 182, 664 
Lebesgue’a measure, 19, 22, 29, 32, 48, 76- 
integral, 33, 47, 48, 86 
Lebesgue Stieltjes' integral, 49,62,66,70,86* 
Length, 19, 76 
Liapounoff's theorem, 215 
inequality, 256 
Likelihood equation, 499 
function, 478, 498 


572 



Lindeberg Livy 's theorem, 216, 286, 316 
Linear equations, 111 
Location, 177 

Logarithniico-normal distribution, 220, 268 

Main diagonal, 106 
Matrix, 103 
ad jugate, 110 
correlation, 296 
diagonal, 105 
moment, 264, 205 
non-singular, 109 
orthogonal, 112 
reciprocal, 110 
singular, 100 
square, 105 
symmetric, 105 
transformation, 107 
unit, 105 • 

zero, 106 

Maximum likelihood estimate, 490, 600 
» » method, 420, 498 

Menu, 178 

» deviation, 181 
» square contingency, 282, 443 
» * regression, 272, 302 

» value, 170, 262, 294 
» » conditional, 267, 269, 302 

Measurable sets, 29, 30, 32 
Measure, 19, 22 
inner, 26 
outer, 26 
Median, 178, 360 
Mendel's experiments, 422 
Method of moments, 497 
Minor, 109 
principal, ]09 
Mode, 179 

Moment, 71, 86, 174, 204, 346, 364 
absolute, 174 

central, 176, 263, 295, 349, 365 

factorial, 267 

inertial, 71, 180 

mixed, 263 

product, 263 

raw, 360 

Multinomial distribution, 318, 418 


Negative binomial distribution, 259, 437 
Neighbourhood, 12, 16 
Normal distribution, 100, 208, 287, 310, 
373, 437, 483, 490, 504, 614, 618, 620, 
633, 535 

Normal distribution, conditioDal, 814 
» * non-singular, 311 

» » singular, 312 

Null hypothesis, 337 
Number of children, 444 

Operations with matrices, 104 
with sets, 4 

Orthogonal polynomials, 131, 276 
transformations, 113 

Parametric space, 616 
Pareto s distribution, 248 
Partial integration, 74 
Pearson's system, 248 
P-measure, 64, 66, 77 » 

Point function, 49 

Poisson's distribution, 203, 260, 319, 434, 
487 

Population, 144, 324 
PostmultiplicatioD, 104 
Prediction, 146, 338 
Premultiplicatioo, 104 
Pr. f., 166 
Prices, 460 

Principal diagonal, 106 
Product space. It, 83 
Probability, 148， 164 
a posteriori, 608 
a priori, 608 
conditional, 168, 164 
fiducial, 607 
inverse, 338, 607, 614 
Probability density, 57, 82, 167, 292 
distribution, 164 ， 164, 166 
element, io/, 

function, 67, 80, 164, 166, 200, 291 

Quantile, 181, 367 
Qaartile, 181 

Rainfall, 468 


573 



Random experiment, 137 
process, 437, 472 
sampling, 323 
sampling numbers, 830 
variable, 161, 164 
Randomized blocks, 543 


Range, 181, 370 
Rank, 109, 297 

Rectangular distribution, 244, 372 
Kedbction of data, 336 
Regression, 270 ， 801, 648, 666 
coeffident, 278, 802, 816, 402, 410 
curve, 270, 272 

linear, 271, 272, 302, 316, 549, 661 
orthogonal, 276, 809 
parabolic, 276, 5U4 
plane, 302, 316 
surface, SOI 
Repeated integrals, 87 
Representative method, 331, 616, 523 


Reaidaal, 305 

Riemann's integral, 36, 88 

Riemann-Stteltjes' integral, 71, 88 



Sample, 144, 824 
characteristics, 341 
grouped, 828, 869 
mean, 846 
point, 383, 496 
space, 388, 478, 496, 610 
values, 324 
Sampling, 144 
biased, 380 
diatribution, 827 
with replacement, 381 
without replacement, 831, 616, 623 
Scatter coefficient, 301, 411 
Schwarz' inequality, 88 
S. d., 180 

Secular equation, 118 
Semi-interquartile range, 181 
Semi-invariant, 186, 192, 296, 849 
Sequenco of distributions, 68, 82 
of fanotions, 40, 46 
of seta, 7, 9, 83 # 

Set, 8 

bounded, 12, 16 
complementary, 7 



critical, 629 
cylinder, 17 
empty, 4 

enumerable, 8, 26 
linear, 10, 16 
rectangle, 18 

Set function, 22, 47, 48, 49, 66, 76, 78, 162 
Sex distribution, 447, 467 
Sheppard’s corrections, 361 
Significance level, 834, 420 
limit, 834 

Significant deviations, 420, 421 
Skewness, 188, 230 
Space, 4 

Standard deviation, 180 
error, 463 

Standardized variable, 180, 186 
Statistical hypotheses, 833, 625 
regularity, 141, 156 
Statares, 461 
Step-function, 68, 66 
Stirling's formula, 128 
Student's distribution, 237, 261 
ratio, 887 

» generalized, 407 
Sobmatrix, 106 
Subscript, primary, 303 
secondary, 303 
Sabeet, 4 
Subspace, 17 
Sam polygon, 328 

Tchebycheff's theorem, 182, 268 
t distribution, 289 
non-central, 318 

Temperatures, 828, 489, 466, 468 
Test, moBt powerful, 629 
unbiased, 581, 636 
uniformly most powerful, 680, 686 
Test of goodness of fit, 834, 410, 460 
of homogeneity, 446 
of tigniflcance, 884, 886, 416, 462， 626 
Tetrad relation, 566 
Time series, 472 

Transformation, contragredient, 111 
linear, 107, 279, 298, 812 
orthogonal, 118 
Transpose, 106 


574 



Triangular distribution ， 246, 372 
inequality, 16 

Uncorrelated variables, 278， 296 

Uniqueness theorem, 93, 101 

Variance, 180, 203, 295, 347 
conditional, 267, 269 
generalized, 301, 406 


ratio, 530 
residual, 278, 805 
Vector, 106 
Volume, 70 

Water lerela, 468 
Wheat yields, 468 

Yates' correction, 446 




